Data Visualization Exercise

-1, How many rows are in penguins? How many columns?

## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Penguins has 344 rows and 8 columns.

-2, What does the bill_depth_mm variable in the penguins data frame describe? Read the help for ?penguins to find out.

#A: It is a number denoting bill depth (millimeters)

-3, Make a scatterplot of bill_depth_mm vs. bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis. Describe the relationship between these two variables.

ggplot(
  data = penguins,
  mapping = aes(x = bill_length_mm, y = bill_depth_mm)
) +
  geom_point()

ggplot(
  data = penguins,
  mapping = aes(x = bill_length_mm, y = bill_depth_mm)
) +
  geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

#A: The scatterplot does not suggest a strong relationship between bill_depth_mm vs. bill_length_mm, as bill_depth_mm data are randomly distributed over bill_length_mm.

-4, What happens if you make a scatterplot of species vs. bill_depth_mm? What might be a better choice of geom?

ggplot( data = penguins, mapping = aes(x = species, y = bill_depth_mm) ) + geom_point()

#A: Depth data of each sample is displaced over three categories of species.It might be better to combined species into the previous bill_depth_mm vs. bill_length_mm.

-5, Why does the following give an error and how would you fix it?

ggplot(data = penguins) + geom_point()

#A: The code did not set the mapping of x-axis and y-axis. I would add "mapping=aes(...)" to include certain variable input and generate the graph.

-6, What does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.

#"na.rm" will remove missing values in the target dataset(if coded as "NA").
ggplot(
  data = penguins,
  mapping = aes(x = bill_length_mm, y = bill_depth_mm)
) +
  geom_point(na.rm=TRUE)

-7, Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().

#Using labs() to include these captions.
ggplot(
  data = penguins,
  mapping = aes(x = bill_length_mm, y = bill_depth_mm)
) + labs( title = "Body mass and flipper length",
    subtitle = "Bill length and depth of Penguins",
    x = "Bill length (mm)", y = "Bill depth(mm)")+
  geom_point(na.rm=TRUE)

<<<<<<< HEAD -8, Recreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?

-8, Recreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?

ggplot(data = penguins, 
            mapping = aes(x = flipper_length_mm, y = body_mass_g)) + 
       geom_point(aes(color = bill_depth_mm)) + 
       geom_smooth() 

#bill_depth_mm should be mapped to geom_point(aes(color=)), and at the geom level. (BUT WHY?)

-9. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

ggplot( data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island) ) + geom_point() + geom_smooth(se = FALSE)

#A: It will generate a flipper_lenght_mm vs body_mass_g plot with three different colors on points to identify which island the sample comes from.

-10. Will these two graphs look different? Why/why not?

ggplot( data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g) ) + geom_point() + geom_smooth()

ggplot() + geom_point( data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g) ) + geom_smooth( data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g) )

#A: Yes, the two graphs will look as same. Because the input of variables are the same for both graphs. They are only different in the way of organizing data (the second one specifies the input in geom(), while the input in those two are the same as mapping.) (Is this a proper answer?)

2.43 Exercise

-1, Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?

#The plot appears to be 'flipped' so the species are appearing on the y axis.

-2, How are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?

ggplot(penguins, aes(x = species)) + geom_bar(color = “red”)

ggplot(penguins, aes(x = species)) + geom_bar(fill = “red”)

#These two plots are different in the color of bars. The fill is more useful for changing the color of bars

-3, What does the bins argument in geom_histogram() do?

#Bins stands the number of "buckets" that data is cut into, automatically it is 30.

-4, Make a histogram of the carat variable in the diamonds dataset that is available when you load the tidyverse package. Experiment with different binwidths. What binwidth reveals the most interesting patterns?

ggplot(diamonds, aes(x = carat)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(bins = 15)

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 1.5)

#The binwidth of 1.5 shows the most interesting pattern with only 3 cuts to the data.

2.5.5 Exercise

-1, The mpg data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models. Which variables in mpg are categorical? Which variables are numerical? (Hint: Type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg?

str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
#manufacturer, model, year, and class are categorical variables
#displ, cyl, trans, drv, cty, hwy, and fl are numerical variables
#I can use the str(mpg) function, which displays the structure of the dataset, including variable names and data types.

-2, Make a scatterplot of hwy vs. displ using the mpg data frame. Next, map a third, numerical variable to color, then size, then both color and size, then shape. How do these aesthetics behave differently for categorical vs. numerical variables?

ggplot(mpg, aes(x = displ, y = hwy, color = cyl)) +
  geom_point()

ggplot(mpg, aes(x = displ, y = hwy, size = cyl)) +
  geom_point()

ggplot(mpg, aes(x = displ, y = hwy, color = cyl, size = cyl)) +
  geom_point()

ggplot(mpg, aes(x = displ, y = hwy, shape = drv)) +
  geom_point()

#When numerical variables are mapped to color or size aesthetics, the values of the numerical variable determine the color or size of points. Higher numerical values often correspond to lighter colors or larger sizes. When categorical variables are mapped to color, each category is assigned a unique color. This aids in distinguishing between different categories in the visualization.When categorical variables are mapped to shape, each category is represented by a distinct shape. This approach is valuable for differentiation.

-3, In the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?

#It would not have effect, since the linewidth aesthetic is typically used for specifying the width of lines in line plots, not for points in scatterplots.

-4, What happens if you map the same variable to multiple aesthetics?

#It could generate different symbols, size and changes of color depending on the certain condition.

<<<<<<< HEAD -5, Make a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables? What about faceting by species?

ggplot(data = penguins,
 mapping = aes(x = bill_length_mm , y = bill_depth_mm))+
  geom_point(aes(color = species))

#Each specie has a distinct cluster that different from other two species in terms of bill_depth_mm vs. bill_length_mm.

-6,Why does the following yield two separate legends? How would you fix it to combine the two legends?

ggplot( data = penguins, mapping = aes( x = bill_length_mm, y = bill_depth_mm, color = species, shape = species ) ) + geom_point() + labs(color = “Species”)

<<<<<<< HEAD ggplot( data = penguins, mapping = aes( x = bill_length_mm, y = bill_depth_mm, color = species, shape = species ) ) + geom_point() + labs(color = “Species”)

#This is because the labs(color = "Species") only generates the legend "Species" in color while missing the shape.
#FIX by combine legends of shape and color 

ggplot(data = penguins,
  mapping = aes(
    x = bill_length_mm, y = bill_depth_mm, 
    color = species, shape = species )) +
  geom_point() +
  scale_color_discrete(name = "Species") +
  scale_shape_discrete(name = "Species")

-7, Create the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?

ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(position = "fill")

ggplot(penguins, aes(x = species, fill = island)) +
  geom_bar(position = "fill")

#First one: what is the distribution of penguin species on different islands?
#Second one: what is the distribution of penguin islands for each species?

2.6.1 Exercise

-1,Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?

ggplot(mpg, aes(x = class)) + geom_bar() ggplot(mpg, aes(x = cty, y = hwy)) + geom_point() ggsave(“mpg-plot.png”)

#The second one is saved for it is the latest one.

2,What do you need to change in the code above to save the plot as a PDF instead of a PNG? How could you find out what types of image files would work in ggsave()?

-To save as a pdf the code needs to be changed from .png to .pdfat the end of file names.

#3.5 Exercise

-1, Why does this code not work? my_variable <- 10 my_varıable - > Error in eval(expr, envir, enclos): object ‘my_varıable’ not found

Look carefully! (This may seem like an exercise in pointlessness, but training your brain to notice even the tiniest difference will pay off when programming.)

#There is a typing error in "my_varıable"

-2,Tweak each of the following R commands so that they run correctly:

libary(todyverse)

ggplot(dTA = mpg) + geom_point(maping = aes(x = displ y = hwy)) + geom_smooth(method = “lm)

library(tidyverse)
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

-3,Press Option + Shift + K / Alt + Shift + K. What happens? How can you get to the same place using the menus?

#Keyboard Shortcut Quick Reference pops up. Using menus to select the tools -> Keyboard shortcuts help

-4, Let’s revisit an exercise from the Section 2.6. Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?

my_bar_plot <- ggplot(mpg, aes(x = class)) + geom_bar() my_scatter_plot <- ggplot(mpg, aes(x = cty, y = hwy)) + geom_point() ggsave(filename = “mpg-plot.png”, plot = my_bar_plot)

#The first one, since the ggsave selects the specific name of plot.

4.2.5, Rows Exercise

library(nycflights13)

-1, In a single pipeline for each condition, find all flights that meet the condition:

library(nycflights13) 
#Had an arrival delay of two or more hours
filter(flights, arr_delay >= 120)
## # A tibble: 10,200 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      811            630       101     1047            830
##  2  2013     1     1      848           1835       853     1001           1950
##  3  2013     1     1      957            733       144     1056            853
##  4  2013     1     1     1114            900       134     1447           1222
##  5  2013     1     1     1505           1310       115     1638           1431
##  6  2013     1     1     1525           1340       105     1831           1626
##  7  2013     1     1     1549           1445        64     1912           1656
##  8  2013     1     1     1558           1359       119     1718           1515
##  9  2013     1     1     1732           1630        62     2028           1825
## 10  2013     1     1     1803           1620       103     2008           1750
## # ℹ 10,190 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
#Flew to Houston (IAH or HOU)
filter(flights, dest == "IAH" | dest == "HOU")
## # A tibble: 9,313 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      623            627        -4      933            932
##  4  2013     1     1      728            732        -4     1041           1038
##  5  2013     1     1      739            739         0     1104           1038
##  6  2013     1     1      908            908         0     1228           1219
##  7  2013     1     1     1028           1026         2     1350           1339
##  8  2013     1     1     1044           1045        -1     1352           1351
##  9  2013     1     1     1114            900       134     1447           1222
## 10  2013     1     1     1205           1200         5     1503           1505
## # ℹ 9,303 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
#Were operated by United, American, or Delta
filter(flights, carrier %in% c("AA", "DL", "UA"))
## # A tibble: 139,504 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      554            600        -6      812            837
##  5  2013     1     1      554            558        -4      740            728
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      558            600        -2      923            937
##  9  2013     1     1      559            600        -1      941            910
## 10  2013     1     1      559            600        -1      854            902
## # ℹ 139,494 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
#Departed in summer (July, August, and September)
filter(flights, month >= 7, month <= 9)
## # A tibble: 86,326 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1        1           2029       212      236           2359
##  2  2013     7     1        2           2359         3      344            344
##  3  2013     7     1       29           2245       104      151              1
##  4  2013     7     1       43           2130       193      322             14
##  5  2013     7     1       44           2150       174      300            100
##  6  2013     7     1       46           2051       235      304           2358
##  7  2013     7     1       48           2001       287      308           2305
##  8  2013     7     1       58           2155       183      335             43
##  9  2013     7     1      100           2146       194      327             30
## 10  2013     7     1      100           2245       135      337            135
## # ℹ 86,316 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
#Arrived more than two hours late, but didn’t leave late
filter(flights, arr_delay > 120, dep_delay <= 0)
## # A tibble: 29 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1    27     1419           1420        -1     1754           1550
##  2  2013    10     7     1350           1350         0     1736           1526
##  3  2013    10     7     1357           1359        -2     1858           1654
##  4  2013    10    16      657            700        -3     1258           1056
##  5  2013    11     1      658            700        -2     1329           1015
##  6  2013     3    18     1844           1847        -3       39           2219
##  7  2013     4    17     1635           1640        -5     2049           1845
##  8  2013     4    18      558            600        -2     1149            850
##  9  2013     4    18      655            700        -5     1213            950
## 10  2013     5    22     1827           1830        -3     2217           2010
## # ℹ 19 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
#Were delayed by at least an hour, but made up over 30 minutes in flight
filter(flights, dep_delay >= 60, dep_delay - arr_delay > 30)
## # A tibble: 1,844 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1     2205           1720       285       46           2040
##  2  2013     1     1     2326           2130       116      131             18
##  3  2013     1     3     1503           1221       162     1803           1555
##  4  2013     1     3     1839           1700        99     2056           1950
##  5  2013     1     3     1850           1745        65     2148           2120
##  6  2013     1     3     1941           1759       102     2246           2139
##  7  2013     1     3     1950           1845        65     2228           2227
##  8  2013     1     3     2015           1915        60     2135           2111
##  9  2013     1     3     2257           2000       177       45           2224
## 10  2013     1     4     1917           1700       137     2135           1950
## # ℹ 1,834 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

-2, Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.

arrange(flights, desc(dep_delay))
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
arrange(flights, dep_delay)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     7     2040           2123       -43       40           2352
##  2  2013     2     3     2022           2055       -33     2240           2338
##  3  2013    11    10     1408           1440       -32     1549           1559
##  4  2013     1    11     1900           1930       -30     2233           2243
##  5  2013     1    29     1703           1730       -27     1947           1957
##  6  2013     8     9      729            755       -26     1002            955
##  7  2013    10    23     1907           1932       -25     2143           2143
##  8  2013     3    30     2030           2055       -25     2213           2250
##  9  2013     3     2     1431           1455       -24     1601           1631
## 10  2013     5     5      934            958       -24     1225           1309
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
#The most delayed flight was HA 51, JFK to HNL, and Flight B6 97 (JFK to DEN) departed 43 minutes early.

-3, Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)

head(arrange(flights, desc(distance / air_time)))
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     5    25     1709           1700         9     1923           1937
## 2  2013     7     2     1558           1513        45     1745           1719
## 3  2013     5    13     2040           2025        15     2225           2226
## 4  2013     3    23     1914           1910         4     2045           2043
## 5  2013     1    12     1559           1600        -1     1849           1917
## 6  2013    11    17      650            655        -5     1059           1150
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
#DL1499 is the fastest flight in terms of speed.

-4,Was there a flight on every day of 2013?

flights %>% 
  filter(year == 2013) %>% 
  distinct(month, day)
## # A tibble: 365 × 2
##    month   day
##    <int> <int>
##  1     1     1
##  2     1     2
##  3     1     3
##  4     1     4
##  5     1     5
##  6     1     6
##  7     1     7
##  8     1     8
##  9     1     9
## 10     1    10
## # ℹ 355 more rows
#YES, every day having a flight.

-5,Which flights traveled the farthest distance? Which traveled the least distance?

flights %>% 
  arrange(desc(distance))
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      857            900        -3     1516           1530
##  2  2013     1     2      909            900         9     1525           1530
##  3  2013     1     3      914            900        14     1504           1530
##  4  2013     1     4      900            900         0     1516           1530
##  5  2013     1     5      858            900        -2     1519           1530
##  6  2013     1     6     1019            900        79     1558           1530
##  7  2013     1     7     1042            900       102     1620           1530
##  8  2013     1     8      901            900         1     1504           1530
##  9  2013     1     9      641            900      1301     1242           1530
## 10  2013     1    10      859            900        -1     1449           1530
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% 
  arrange(distance)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7    27       NA            106        NA       NA            245
##  2  2013     1     3     2127           2129        -2     2222           2224
##  3  2013     1     4     1240           1200        40     1333           1306
##  4  2013     1     4     1829           1615       134     1937           1721
##  5  2013     1     4     2128           2129        -1     2218           2224
##  6  2013     1     5     1155           1200        -5     1241           1306
##  7  2013     1     6     2125           2129        -4     2224           2224
##  8  2013     1     7     2124           2129        -5     2212           2224
##  9  2013     1     8     2127           2130        -3     2304           2225
## 10  2013     1     9     2126           2129        -3     2217           2224
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
#HA51 travels the farthest distance, and US1632 has the shortest distance.

-6,Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

#The order will not affect the result when using both filter() and arrange(), since arrange() only arranging the data instead of filtering it

<<<<<<< HEAD

## 4.35 Exercise

4.35 Exercise

3fa362b4b938c04a07e346096fb3a6c6b8adb433

-1, Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

#I will expect: dep_delay = dep_time - sched_dep_time

-2, Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

flights %>% 
  select(dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ℹ 336,766 more rows
flights %>% 
  select(starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ℹ 336,766 more rows
flights %>% 
  select(c(dep_time, dep_delay, arr_time, arr_delay))
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ℹ 336,766 more rows

-3, What happens if you specify the name of the same variable multiple times in a select() call?

flights %>% 
  select(dep_time, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ℹ 336,766 more rows
#Nothing happened

-4, What does the any_of() function do? Why might it be helpful in conjunction with this vector?

#It returns all the variables you ask for, and this is helpful in finding out certain column in the dataset, as following
variables <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>% 
  select(any_of(variables))
## # A tibble: 336,776 × 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # ℹ 336,766 more rows

-5, Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?

flights |> select(contains(“TIME”))

#Any column name including TIME is presented in results, which is not accurate. I can change it as following
select(flights, contains("TIME",  ignore.case = FALSE))
## # A tibble: 336,776 × 0

-6,Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.

flights |> 
  rename(air_time_min = air_time)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time_min <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> 
  relocate(air_time)
## # A tibble: 336,776 × 19
##    air_time  year month   day dep_time sched_dep_time dep_delay arr_time
##       <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1      227  2013     1     1      517            515         2      830
##  2      227  2013     1     1      533            529         4      850
##  3      160  2013     1     1      542            540         2      923
##  4      183  2013     1     1      544            545        -1     1004
##  5      116  2013     1     1      554            600        -6      812
##  6      150  2013     1     1      554            558        -4      740
##  7      158  2013     1     1      555            600        -5      913
##  8       53  2013     1     1      557            600        -3      709
##  9      140  2013     1     1      557            600        -3      838
## 10      138  2013     1     1      558            600        -2      753
## # ℹ 336,766 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

-7, Why doesn’t the following work, and what does the error mean?

flights |> select(tailnum) |> arrange(arr_delay) > Error in arrange(): > ℹ In argument: ..1 = arr_delay. > Caused by error: > ! object ‘arr_delay’ not found

flights |> 
  select(tailnum, arr_delay) |> 
  arrange(arr_delay)
## # A tibble: 336,776 × 2
##    tailnum arr_delay
##    <chr>       <dbl>
##  1 N843VA        -86
##  2 N840VA        -79
##  3 N851UA        -75
##  4 N3KCAA        -75
##  5 N551AS        -74
##  6 N24212        -73
##  7 N3760C        -71
##  8 N806UA        -71
##  9 N805JB        -71
## 10 N855VA        -70
## # ℹ 336,766 more rows
#The select() does not include the arr_delay.

<<<<<<< HEAD

4.5.7 Exercise.

1, Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))

4.5.7 Exercise.

1, Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))

3fa362b4b938c04a07e346096fb3a6c6b8adb433

flights %>%
  group_by(carrier) %>%
  summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(arr_delay))
## # A tibble: 16 × 2
##    carrier arr_delay
##    <chr>       <dbl>
##  1 F9         21.9  
##  2 FL         20.1  
##  3 EV         15.8  
##  4 YV         15.6  
##  5 OO         11.9  
##  6 MQ         10.8  
##  7 WN          9.65 
##  8 B6          9.46 
##  9 9E          7.38 
## 10 UA          3.56 
## 11 US          2.13 
## 12 VX          1.76 
## 13 DL          1.64 
## 14 AA          0.364
## 15 HA         -6.92 
## 16 AS         -9.93
#F9 has the worst average delays (F9 is Frontier Airline) 

-2, Find the flights that are most delayed upon departure from each destination.

-3, How do delays vary over the course of the day. Illustrate your answer with a plot.

-4, What happens if you supply a negative n to slice_min() and friends?

-5, Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?

-6, Suppose we have the following tiny data frame:

df <- tibble( x = 1:5, y = c(“a”, “b”, “a”, “a”, “b”), z = c(“K”, “K”, “L”, “L”, “K”) )

Write down what you think the output will look like, then check if you were correct, and describe what group_by() does.

df |> group_by(y)

Write down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also comment on how it’s different from the group_by() in part (a)?

df |> arrange(y)

Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.

df |> group_by(y) |> summarize(mean_x = mean(x))

Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.

df |> group_by(y, z) |> summarize(mean_x = mean(x))

Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d).

df |> group_by(y, z) |> summarize(mean_x = mean(x), .groups = “drop”)

Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?

df |> group_by(y, z) |> summarize(mean_x = mean(x))

df |> group_by(y, z) |> mutate(mean_x = mean(x))

5.6 Exercise

-1,Restyle the following pipelines following the guidelines above.

flights|>filter(dest==“IAH”)|>group_by(year,month,day)|>summarize(n=n(), delay=mean(arr_delay,na.rm=TRUE))|>filter(n>10)

flights|>filter(carrier==“UA”,dest%in%c(“IAH”,“HOU”),sched_dep_time> 0900,sched_arr_time<2000)|>group_by(flight)|>summarize(delay=mean( arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|>filter(n>10)

#Repair
flights|>
  filter(dest=="IAH")|>
  group_by(year,month,day)|>
  summarize(n=n(),
delay=mean(arr_delay,na.rm=TRUE))|>
  filter(n>10)
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day     n delay
##    <int> <int> <int> <int> <dbl>
##  1  2013     1     1    20 17.8 
##  2  2013     1     2    20  7   
##  3  2013     1     3    19 18.3 
##  4  2013     1     4    20 -3.2 
##  5  2013     1     5    13 20.2 
##  6  2013     1     6    18  9.28
##  7  2013     1     7    19 -7.74
##  8  2013     1     8    19  7.79
##  9  2013     1     9    19 18.1 
## 10  2013     1    10    19  6.68
## # ℹ 355 more rows
flights|>
  filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time>
0900,sched_arr_time<2000)|>
  group_by(flight)|>
  summarize(delay=mean(
arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n()
)|>
  filter(n>10)
## # A tibble: 74 × 4
##    flight delay cancelled     n
##     <int> <dbl>     <int> <int>
##  1     53 12.5          2    18
##  2    112 14.1          0    14
##  3    205 -1.71         0    14
##  4    235 -5.36         0    14
##  5    255 -9.47         0    15
##  6    268 38.6          1    15
##  7    292  6.57         0    21
##  8    318 10.7          1    20
##  9    337 20.1          2    21
## 10    370 17.5          0    11
## # ℹ 64 more rows

6.2.1

-1, For each of the sample tables, describe what each observation and each column represents.

#Table 1 includes columns: country, year, cases, population with each observation on a row and value of a variable for the observation.
#Table 2 has columns: country, year, type, count. Type is a Character variable to indicate if a country is cases or population.
#Table 3 has columns: country, year, and rate. The rate is an emerged column of cases and population from table 1.
#Table 4 is split across two tables. The first one gives the cases with the column variables being years, while the second one gives population instead of cases.

-2, Sketch out the process you’d use to calculate the rate for table2 and table3. You will need to perform four operations:

Extract the number of TB cases per country per year. Extract the matching population per country per year. Divide cases by population, and multiply by 10000. Store back in the appropriate place. You haven’t yet learned all the functions you’d need to actually perform these operations, but you should still be able to think through the transformations you’d need.

#Extract the number of TB cases per country per year.
table2_cases <- table2 %>%
  filter(type == "cases")

#Extract the matching population per country per year
table2_pop <- table2 %>%
    filter(type == "population")

#Divide cases by population, and multiply by 10000 
table2_com <- tibble(
  country = table2_cases$country,
  year = table2_cases$year,
  cases = table2_cases$count,
  population = table2_pop$count
  )
#Store back in the appropriate place.
table2_com <- table2_com %>%
    mutate(rate = (cases / population) * 10000)

-Notes for new verbs: pivot_longer(), from row to column/parse_number()/names_sep =, delete certain character/distinct ()/pivot_wider(), from column to row/names_from = “some column/row”/id_cols = starts_with (“select specific values”)

-Difference between tibble() and tribble ()?

7.3 Exercise.

-1, Go to the RStudio Tips Twitter account, https://twitter.com/rstudiotips and find one tip that looks interesting. Practice using it!

-2, What other common mistakes will RStudio diagnostics report? Read https://support.posit.co/hc/en-us/articles/205753617-Code-Diagnostics to find out.

8.2.4

-What function would you use to read a file where fields were separated with “|”?

#We could use "read_delim" to read files containing delimiter

-Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?

#Common: col_names, col_types, col_select, id, locale, na, trim_ws, quoted_na, quote, comment,  skip, n_max, guess_max, progress, name_repair, num_threads, show_col_types, skip_empty_rows, lazy

-What are the most important arguments to read_fwf()?

#I would say "fwf_width" and "fwf_positions" are most important, as they specify fields which defines vectors for extract data.

-Sometimes strings in a CSV file contain commas. To prevent them from causing problems, they need to be surrounded by a quoting character, like ” or ’. By default, read_csv() assumes that the quoting character will be “. To read the following text into a data frame, what argument to read_csv() do you need to specify?

“x,y,‘a,b’”

#Probably use "quoted_na = TRUE" or "quote = "
read_csv("x,y\n1,'a,b'", quote = "'")
## Rows: 1 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): y
## dbl (1): x
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1 × 2
##       x y    
##   <dbl> <chr>
## 1     1 a,b

-Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

read_csv(“a,b,2,3,5,6”) read_csv(“a,b,c,2,2,3,4”) read_csv(“a,b”1”) read_csv(“a,b,2,b”) read_csv(“a;b;3”)

read_csv("a,b\n1,2,3\n4,5,6")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 2 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): a
## num (1): b
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 2
##       a     b
##   <dbl> <dbl>
## 1     1    23
## 2     4    56
#Prasing issues in column specification. Second row has two values, while the third row has three values.

read_csv("a,b,c\n1,2\n1,2,3,4")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): a, b
## num (1): c
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     2    NA
## 2     1     2    34
#There are three header columns in the data frame, while the following row includes four values, which does not match the number of colums.

read_csv("a,b\n\"1")
## Rows: 0 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): a, b
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 0 × 2
## # ℹ 2 variables: a <chr>, b <chr>
#The dataset includes two header columns but only specifies one value in the firts row.

read_csv("a,b\n1,2\na,b")
## Rows: 2 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): a, b
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 2
##   a     b    
##   <chr> <chr>
## 1 1     2    
## 2 a     b
read_csv("a;b\n1;3")
## Rows: 1 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): a;b
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1 × 1
##   `a;b`
##   <chr>
## 1 1;3
#For dataset includes ";", should use read_csv2()

-Practice referring to non-syntactic names in the following data frame by:

Extracting the variable called 1. Plotting a scatterplot of 1 vs. 2. Creating a new column called 3, which is 2 divided by 1. Renaming the columns to one, two, and three. annoying <- tibble( 1 = 1:10, 2 = 1 * 2 + rnorm(length(1)) )

annoying <- tibble(
  `1` = 1:10,
  `2` = `1` * 2 + rnorm(length(`1`))
)

annoying %>%
  select('1' = 1, '2' )
## # A tibble: 10 × 2
##      `1`   `2`
##    <int> <dbl>
##  1     1  4.43
##  2     2  2.97
##  3     3  7.25
##  4     4  9.31
##  5     5 11.5 
##  6     6 10.6 
##  7     7 13.5 
##  8     8 15.3 
##  9     9 20.1 
## 10    10 20.0

9 Workflow: getting help

9.1 Google is your friend

  • It is helpful to use Google and Stack Overflow to solve code errors.

9.2 Making a reprex

  • Make code reproducible by reprex() to format default output in github. Use dput() to generate the R code needed to recreate it

9.3 Investing in yourself

1, doing tidyverse blog (https://www.tidyverse.org/blog/). 2, reading R_Weekly (https://rweekly.org/)

10 Layers

10.1 Introduction

  • library(tidyverse)

10.2 Aesthetic mappings

  • ggplot2 only use six shapes at a time.

  • Mapping an unordered discrete (categorical) variable (class) to an ordered aesthetic (size or alpha) is generally not a good idea because it implies a ranking that does not in fact exist.

  • You can customize size, shape, and color by inputing different numbers (for size and shape) and color names (for color).

10.2.1 Exercises

1, Create a scatterplot of hwy vs. displ where the points are pink filled in triangles.

library(tidyverse)
ggplot(mpg, aes(x = hwy, y = displ)) + 
  geom_point(color = "pink", shape = 17)

2. Why did the following code not result in a plot with blue points?

  • A: The code failed to add aesthetic mapping to the “color =”
  1. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
  • A: Stroke modifies the width of the border to assist coloring the inside and outside differently. It works with shapes having a border.
  1. What happens if you map an aesthetic to something other than a variable name, like aes(color = displ < 5)? Note, you’ll also need to specify x and y.
ggplot(mpg, aes(x = hwy, y = displ,color = displ < 5))+geom_point()

  • A: It generates a logical variable and default set different colors for points with dipsl <5 and >= 5.

10.3 Geometric objects

-geom_smooth() separates the graph into various lines based on a category variable. Other types of geom including geom_histogram(), geom_density(), geom_boxplot(), and more (https://ggplot2.tidyverse.org/reference.) .

10.3.1 Exercises

  1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
  • A: Shouldn’t we use geom_line() for a line chart? None of the three mentioned in question.
  1. Earlier in this chapter we used show.legend without explaining it:

ggplot(mpg, aes(x = displ, y = hwy)) + geom_smooth(aes(color = drv))

What does show.legend = FALSE do here? What happens if you remove it? Why do you think we used it earlier?

  • A: “show.legend = FALSE” will remove the notification of dry with each color in the right. Removing it will have the color notification pops up in the right. I think we used it earlier for clarifying which points of different color belongs to different drv.
  1. What does the se argument to geom_smooth() do?
  • A: Display confidence interval around smooth? (TRUE by default, see level to control.)
  1. Recreate the R code necessary to generate the following graphs. Note that wherever a categorical variable is used in the plot, it’s drv.
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(group = drv), se = FALSE) +
  geom_point()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) +
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(colour = drv)) +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy, colour = drv, linetype = drv)) +
  geom_point() +
  geom_smooth( se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) +
  geom_point(size = 4, color = "white") +
  geom_point()

10.4 Facets

  • facet_wrap() splits a plot into subplots. Also facet_grid(), which is a double sided formula: rows~cols. We add scales to the facet_grid().

10.4.1 Exercies

1 What happens if you facet on a continuous variable? - It will return multiple subplots. Each subplot contains a certain value of the continuous variable.

  1. What do the empty cells in the plot above with facet_grid(drv ~ cyl) mean? Run the following code. How do they relate to the resulting plot?
ggplot(mpg) + 
  geom_point(aes(x = drv, y = cyl))

- Empty cells in the plot above stands for no observations for the formula drv ~ cyl in the data set. The plot generated by codes in question plot the combination of drv and cyl. These points are not displayed in the previous plot.

3.What plots does the following code make? What does . do?

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

- The ‘,’ removes the dimension of facet_grid(). The first plot contains values of drv on y-axis, while the second places values of cyl on x-axis.

4.Take the first faceted plot in this section:

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset? - Advantage: Specify each categories of variables. Disadvantage: Lack of color for clear identification?

5.Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments? - nrow and nocl determines the number of rows and columns correspondingly in facet_wrap(). They are not used in facet_grid(), as facet_grid() does not require specifying number of rows and columns.

  1. Which of the following plots makes it easier to compare engine size (displ) across cars with different drive trains? What does this say about when to place a faceting variable across rows or columns?
ggplot(mpg, aes(x = displ)) + 
  geom_histogram() + 
  facet_grid(drv ~ .)

ggplot(mpg, aes(x = displ)) + 
  geom_histogram() +
  facet_grid(. ~ drv)

Recreate the following plot using facet_wrap() instead of facet_grid(). How do the positions of the facet labels change?

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) +
  facet_wrap(drv ~ .)

  • Place values of drv on x-axis make it easier compare engine size (displ) across cars with different drive trains. It is better align the comparison at the same axis for quick comparison.
  • facet_wrap() changes the facet labels to the x-axis.

10.5 Statistical transformations

  • geom_bar(stat = “identity”) to map the height of the bars to the raw values of a y variable. stat_summary() can summarize the y values for each unique x value. And always use ?stat_bin for help.

10.5.1 Exercies

1.What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary",
    fun.min = min,
    fun.max = max,
    fun = median)

- A: The default geom asociated with stat_summary() is geom_pointrange()

2.What does geom_col() do? How is it different from geom_bar()? -A:geom_col() makes the heights of the bars to represent values in the data. It is different from geom_bar in the function of bars and default.

3.Most geoms and stats come in pairs that are almost always used in concert. Make a list of all the pairs. What do they have in common? (Hint: Read through the documentation.) - A: Google for the table. Many geoms and stats have stat_identity as the default sta.

4.What variables does stat_smooth() compute? What arguments control its behavior? - A: It computes y, ymax, and ymin. The argument “method”, “formula”, “method.arg” controls its behavior.

5.In our proportion bar chart, we need to set group = 1. Why? In other words, what is the problem with these two graphs?

ggplot(diamonds, aes(x = cut, y = after_stat(prop))) + 
  geom_bar()

ggplot(diamonds, aes(x = cut, y = after_stat(prop), group=1)) + 
  geom_bar()

ggplot(diamonds, aes(x = cut, fill = color, y = after_stat(prop))) + 
  geom_bar()

ggplot(diamonds, aes(x = cut, fill = color, y = after_stat(prop), group=1)) +
  geom_bar()

- Without group = 1, the plot does not represent the proportion of each group in the dataset. Besides, when use group = 1 with fill, we need to take care of the grouping structure.

10.6 Position adjustments

  • With a set x, whatever variables put in fill would return the combination of the variable with the set X.
  • Three other opotions “identity”, “dodge”, and “fill”
  • position = “identity” will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them.
  • Position = “fill” works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.
  • Position = “dodge” places overlapping objects directly beside one another. This makes it easier to compare individual values.
  • Position = “jitter” adds a small amount of random noise to each point.

10.6.1

1.What is the problem with the following plot? How could you improve it?

ggplot(mpg, aes(x = cty, y = hwy)) + 
  geom_point()

ggplot(mpg, aes(x = cty, y = hwy)) + 
  geom_point(position = "jitter")

  • The combination of cty and hwy include lots of overlapping.We can add a position = “jitter”to create noise for each point.

2.What, if anything, is the difference between the two plots? Why?

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(position = "identity")

- There is no graphical difference between two plots, since position = “identity” will place each object exactly where it falls in the context of the graph, as default geom_point() represents.

  1. What parameters to geom_jitter() control the amount of jittering? -The width and height.

  2. Compare and contrast geom_jitter() with geom_count(). -geom_count would not change the poisition of points, but would overlap point with close range to eact other, while geom_jitter() would add random variations to points.

  3. What’s the default position adjustment for geom_boxplot()? Create a visualization of the mpg dataset that demonstrates it.

ggplot(data = mpg, aes(x = hwy, y = displ, colour = class)) +
  geom_boxplot(position = "identity")

- The default position adjustment is position_dodge2.

10.7 Coordinate systems

  • coord_quickmap() sets the aspect ratio correctly for geographic maps. coord_polar() uses polar coordinates.

10.7.1 Exericises

1.Turn a stacked bar chart into a pie chart using coord_polar().

ggplot(mpg, aes(x = factor(1), fill = class)) +
  geom_bar()

ggplot(mpg, aes(x = factor(1), fill = class)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y")

2.What’s the difference between coord_quickmap() and coord_map()? -? coord_quickmap(). ?coord_map() - coord_quickmap() makes maps faster but more approxmiate than coord_map()

3.What does the following plot tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

  • Without coord_fixed(), the coordinate would change and miss some points. geom_abline() plays a role of adding the fixed makes it easy to compare the highway and city mileage.

10.8 The layered grammar of graphics

  • Select the necessary data from raw data to create the graph you needed.

11 Exploratory data analysis

11.1 Introduction

  • Questions around the data/Search for answers by visualizing, transforming, and modelling your data/Use what you learn to refine your questions and/or generate new questions.

11.2 Questions

  • Using visualization and grouping. How are the observations within each subgroup similar to each other? How are the observations in separate clusters different from each other? How can you explain or describe the clusters? Why might the appearance of clusters be misleading?

  • coord_cartesian() can be used to select unusual values

11.3.3 Exercises

1.Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

#Distribution of x
ggplot(diamonds) +
  geom_histogram(mapping = aes(x = x), binwidth = 0.01)

#Distribution of y
ggplot(diamonds) +
  geom_histogram(mapping = aes(x = y), binwidth = 0.01)

#Distribution of z
ggplot(diamonds) +
  geom_histogram(mapping = aes(x = z), binwidth = 0.01)

- All three variables are right skewed in general and having some noticing outliners. Probably y is length for its larger values in general, z would be the width, and x is the depth.

2.Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

ggplot(diamonds, aes(x = price)) +
  geom_histogram(binwidth = 10, center = 0)

- The price is left skewed in general, and has a lot of outlines.Most points are located below 2000$.

3.How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

#Diamonds with 0.99 and 1 carat
diamonds %>%
  filter(carat >= 0.99, carat <= 1) %>%
  count(carat)
## # A tibble: 2 × 2
##   carat     n
##   <dbl> <int>
## 1  0.99    23
## 2  1     1558
  • There are 1558 1-carat diamonds and 23 0.99-carat diamonds. The difference could be natural pheromone that 0.99-carat diamonds are naturally more rare.

4.Compare and contrast coord_cartesian() vs. xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

#Leave bin width unset
ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  coord_cartesian(xlim = c(114, 514), ylim = c(1000, 4000))

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  xlim(114, 514) +
  ylim(1000, 4000)

- coord_cartesian() is zooming a particular part of the entire coordinate, while xlim() and ylim() select the particular part out of the coordinate. As we see above, the xlim() and ylim() cut off points out of the range selected. Leave bin width unset has no real effect on the display.

11.4 Unusual Vlaves & Exercises

  • Drop rows with strange values.

1.What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?

  • (Google)The missing values in a histogram would be removed. “In the geom_bar() function, NA is treated as another category. The x aesthetic in geom_bar() requires a discrete (categorical) variable, and missing values act like another category. In a histogram, the x aesthetic variable needs to be numeric, and stat_bin() groups the observations by ranges into bins. Since the numeric value of the NA observations is unknown, they cannot be placed in a particular bin, and are dropped.”

2.What does na.rm = TRUE do in mean() and sum()?

  • It reomves NA values before calculating mean and sum.

3.Recreate the frequency plot of scheduled_dep_time colored by whether the flight was cancelled or not. Also facet by the cancelled variable. Experiment with different values of the scales variable in the faceting function to mitigate the effect of more non-cancelled flights than cancelled flights.

nycflights13::flights |> 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + (sched_min / 60)
  ) |> 
  ggplot(aes(x = sched_dep_time)) + 
  geom_freqpoly(aes(color = cancelled), binwidth = 1/4)

11.5

  • Covariation is the tendency for the values of two or more variables to vary together in a related way.

11.5.1 A categorical and a numerical variable

  • fct_reorder() to reorder variables for a more information disply.
ggplot(mpg, aes(x = fct_reorder(class, hwy, median), y = hwy)) +
  geom_boxplot()

11.5.1.1 Exercises

1.Use what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.

nycflights13::flights |> 
  mutate(cancelled = is.na(dep_time) | is.na(arr_time)) %>% 
  ggplot() +
  geom_boxplot(aes(x = cancelled, y = dep_time))

2.Based on EDA, what variable in the diamonds dataset appears to be most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

#Correlation between carat and price
ggplot(diamonds) +
  geom_point(aes(x = carat, y = price), color = "green", alpha = 0.76)

#Correlation between depth and price
ggplot(diamonds) +
  geom_point(aes(x = depth, y = price), color = "green", alpha = 0.76)

#Correlation between z and price
ggplot(diamonds) +
  geom_point(aes(x = z, y = price), color = "green", alpha = 0.76)

#Correlation between x and price
ggplot(diamonds) +
  geom_point(aes(x = x, y = price), color = "green", alpha = 0.76)

- It appears that x has the highest correlation with price among these variables.

3.Instead of exchanging the x and y variables, add coord_flip() as a new layer to the vertical boxplot to create a horizontal one. How does this compare to exchanging the variables?

4.One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs. cut. What do you learn? How do you interpret the plots?

5.Create a visualization of diamond prices vs. a categorical variable from the diamonds dataset using geom_violin(), then a faceted geom_histogram(), then a colored geom_freqpoly(), and then a colored geom_density(). Compare and contrast the four plots. What are the pros and cons of each method of visualizing the distribution of a numerical variable based on the levels of a categorical variable?

6.If you have a small dataset, it’s sometimes useful to use geom_jitter() to avoid overplotting to more easily see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

11.5.2 Two categorical variables

  • To create the plot with two categorical variables, use geom_count()
  • Use dplyr to computing the counts between these variables.
  • Visualize with geom_tile() and the fill aesthetic.

11.5.3 Two numerical variables

  • Surely use geom_point() to plot the numerical variables.alpha aesthetic can add transparency.
  • New tools to bin in one dimension: geom_bin2d() and geom_hex().

11.6 Patterns and models

  • For a systematic relationship exists between two variables, it will appear as a pattern in the data. Such pattern should be considered: if it is a coincidence? What relationship implied by the pattern? How strong the relationship is? What other verbs may affect the relationship? Does the relationship change if you look at individual subgroups of the data?

12 Communication

12.1.1 Prerequisites

library(tidyverse)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(ggrepel)
library(patchwork)

12.2 Labels

  • labs() adds names to elements in the coordinate. Those elements include x, y, color, title, subtitle, caption.

12.2.1 Exercises

1.Create one plot on the fuel economy data with customized title, subtitle, caption, x, y, and color labels.

  ggplot()+
  geom_point(data = mpg, aes( x = hwy, y = displ, colour = drv, shape = drv))+
  labs( x = "Engine displacement (L)",
        y = "Highway fuel economy (mpg)",
        title = "Large engine displacement results in lower gas mileage performance",
        subtitle = "SUV and pickup classes have more small engine & high mpg combination",
        caption = "Dataset from tidyverse")

2.Recreate the following plot using the fuel economy data. Note that both the colors and shapes of points vary by type of drive train.

  ggplot(mpg, aes(x = cty, y= hwy, shape = drv, color = drv))+
  geom_point()+
  labs( x = "City MPG",
        y = "Highway MPG",
        shape = "Type of drive train")

3.Take an exploratory graphic that you’ve created in the last month, and add informative titles to make it easier for others to understand.

12.3 Annotations

12.3.1 Exercises

1.Use geom_text() with infinite positions to place text at the four corners of the plot.

label <- tribble(
  ~displ, ~hwy, ~label, ~vjust, ~hjust,
  Inf, Inf, "Top right", "top", "right",
  Inf, -Inf, "Bottom right", "bottom", "right",
  -Inf, Inf, "Top left", "top", "left",
  -Inf, -Inf, "Bottom left", "bottom", "left"
)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_text(aes(label = label, vjust = vjust, hjust = hjust), data = label)

  1. Use annotate() to add a point geom in the middle of your last plot without having to create a tibble. Customize the shape, size, or color of the point.
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
   annotate(geom = "label", x= max(mpg$displ), y= max(mpg$hwy),
    label = "Top right", vjust = "top",
    hjust = "right", color = "red"
  ) +
   annotate(geom = "label", x= min(mpg$displ), y= max(mpg$hwy),
    label = "Top left", vjust = "top",
    hjust = "left", color = "red"
  ) +
   annotate(geom = "label", x= max(mpg$displ), y= min(mpg$hwy),
    label = "Bottom right", vjust = "bottom",
    hjust = "right", color = "red"
  ) +
   annotate(geom = "label", x= min(mpg$displ), y= min(mpg$hwy),
    label = "Bottom left", vjust = "bottom",
    hjust = "left", color = "red"
  ) 

  1. How do labels with geom_text() interact with faceting? How can you add a label to a single facet? How can you put a different label in each facet? (Hint: Think about the dataset that is being passed to geom_text().)
# labels in each different plots
label <- tibble(
  displ = Inf,
  hwy = Inf,
  class = unique(mpg$class),
  label = str_c("Label for ", class)
)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_text(aes(label = label),
    data = label, vjust = "top", hjust = "right",
    size = 3
  ) +
  facet_wrap(~class)

  1. What arguments to geom_label() control the appearance of the background box?

-label.padding: padding around label -label.r: amount of rounding in the corners -label.size: size of label border

  1. What are the four arguments to arrow()? How do they work? Create a series of plots that demonstrate the most important options.

-angle : angle of arrow head -length : length of the arrow head -ends: ends of the line to draw arrow head -type: “open” or “close”: whether the arrow head is a closed or open triangle

12.4 Scales

12.4.1 Default scales

  • scale_ followed by the name of the aesthetic, then _, then the name of the scale. The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date. scale_x_continuous() puts the numeric values from displ on a continuous number line on the x-axis, scale_color_discrete() chooses colors for each of the class of car, etc.

12.4.2 Axis ticks and legend keys

-There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: breaks and labels. Breaks controls the position of the ticks, or the values associated with the keys. Labels controls the text label associated with each tick/key.

-We can use the labels in the same way. label_dollar will add dollar sign. label_percent() add percentage.

12.4.3 Legend layout

-To control the overall position of the legend, you need to use a theme() setting.The theme setting legend.position controls where the legend is drawn.

-To control the display of individual legends, use guides() along with guide_legend() or guide_colorbar(). Note that the name of the argument in guides() matches the name of the aesthetic, just like in labs().

12.4.4 Replacing a scale

-It’s very useful to plot transformations of your variable. The ColorBrewer scales are documented online at https://colorbrewer2.org/ and made available in R via the RColorBrewer package, by Erich Neuwirth.

-For continuous color, you can use the built-in scale_color_gradient() or scale_fill_gradient(). If you have a diverging scale, you can use scale_color_gradient2().

-Note that all color scales come in two varieties: scale_color_() and scale_fill_() for the color and fill aesthetics respectively (the color scales are available in both UK and US spellings).

12.4.5 Zooming

-There are three ways to control the plot limits:

-Adjusting what data are plotted. -Setting the limits in each scale. -Setting xlim and ylim in coord_cartesian().

-To zoom in on a region of the plot, it’s generally best to use coord_cartesian().

-Setting the limits on individual scales is generally more useful if you want to expand the limits, e.g., to match scales across different plots.

12.4.6 Exercises

  1. Why doesn’t the following code override the default scale?
df <- tibble(
  x = rnorm(10000),
  y = rnorm(10000)
)

ggplot(df, aes(x, y)) +
  geom_hex() +
  scale_color_gradient(low = "white", high = "red") +
  coord_fixed()

-Because the colors in geom_hex() are set by the fill aesthetic, not the color aesthetic.

  1. What is the first argument to every scale? How does it compare to labs()?

-The first argument to every scale is the label for the scale. It is equivalent to using the labs function.

  1. Change the display of the presidential terms by:

a.Combining the two variants that customize colors and x axis breaks. b.Improving the display of the y axis. c.Labelling each term with the name of the president. d.Adding informative plot labels. e.Placing breaks every 4 years (this is trickier than it seems!).

fouryears <- lubridate::make_date(seq(year(min(presidential$start)),
  year(max(presidential$end)),
  by = 4
), 1, 1)

presidential %>%
  mutate(
    id = 33 + row_number(),
    name_id = fct_inorder(str_c(name, " (", id, ")"))
  ) %>%
  ggplot(aes(start, name_id, colour = party)) +
  geom_point() +
  geom_segment(aes(xend = end, yend = name_id)) +
  scale_colour_manual("Party", values = c(Republican = "red", Democratic = "blue")) +
  scale_y_discrete(NULL) +
  scale_x_date(NULL,
    breaks = presidential$start, date_labels = "'%y",
    minor_breaks = fouryears
  ) +
  ggtitle("Terms of US Presdients",
    subtitle = "Roosevelth (34th) to Obama (44th)"
  ) +
  theme(
    panel.grid.minor = element_blank(),
    axis.ticks.y = element_blank()
  )

4.First, create the following plot. Then, modify the code using override.aes to make the legend easier to see.

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(aes(color = cut), alpha = 1/20)+
  theme(legend.position = "bottom")+
  guides(color = guide_legend(nrow=2, override.aes = list(alpha = 1)))

12.5 Themes

12.5.1 Exercises

  1. Pick a theme offered by the ggthemes package and apply it to the last plot you made.
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(aes(color = cut), alpha = 1/20)+
  theme(legend.position = c(0.6, 0.7),
    legend.direction = "horizontal",
    legend.box.background = element_rect(color = "black"),
    plot.title = element_text(face = "bold"),
    plot.title.position = "plot",
    plot.caption.position = "plot",
    plot.caption = element_text(hjust = 0))+
  guides(color = guide_legend(nrow=2, override.aes = list(alpha = 1)))

12.6 Layout

12.6.1 Exercises

  1. What happens if you omit the parentheses in the following plot layout. Can you explain why this happens?
p1 <- ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  labs(title = "Plot 1")
p2 <- ggplot(mpg, aes(x = drv, y = hwy)) + 
  geom_boxplot() + 
  labs(title = "Plot 2")
p3 <- ggplot(mpg, aes(x = cty, y = hwy)) + 
  geom_point() + 
  labs(title = "Plot 3")
(p1 | p2) / p3

-The plot layout would not be generated. Because,the parentheses designs the position of different plots in the layout.

  1. Using the three plots from the previous exercise, recreate the following patchwork.

Three plots: Plot 1 is a scatterplot of highway mileage versus engine size. Plot 2 is side-by-side box plots of highway mileage versus drive train. Plot 3 is side-by-side box plots of city mileage versus drive train. Plots 1 is on the first row. Plots 2 and 3 are on the next row, each span half the width of Plot 1. Plot 1 is labelled “Fig. A”, Plot 2 is labelled “Fig. B”, and Plot 3 is labelled “Fig. C”.

po1 <- ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  labs(title = "Plot 1")

po2 <- ggplot(mpg, aes(x = drv, y = hwy)) + 
  geom_boxplot() + 
  labs(title = "Plot 2")

po3 <- ggplot(mpg, aes(x = cty, y = hwy)) + 
  geom_point() + 
  labs(title = "Plot 3")
po1/(po2 | po3)

13 Logical vectors

13.1 Introduction

13.1.1 Prerequisites

library(tidyverse)
library(nycflights13)
#do to a variable inside a data frame with mutate()

13.2 Comparison

-We use filter() and <, <=, >, >=, != to make Comparison. -Use digits to compare floating point

  1. How does dplyr::near() work? Type near to see the source code. Is sqrt(2)^2 near 2?
near(sqrt(2)^2, 2)
## [1] TRUE
  1. Use mutate(), is.na(), and count() together to describe how the missing values in dep_time, sched_dep_time and dep_delay are connected.
flights |> 
    mutate(dep_time_na = is.na(dep_time),
         sched_dep_time_na = is.na(sched_dep_time),
         dep_delay_na = is.na(dep_delay)) |>
  count(dep_time_na, sched_dep_time_na, dep_delay_na)
## # A tibble: 2 × 4
##   dep_time_na sched_dep_time_na dep_delay_na      n
##   <lgl>       <lgl>             <lgl>         <int>
## 1 FALSE       FALSE             FALSE        328521
## 2 TRUE        FALSE             TRUE           8255

13.3 Boolean algebra

1.Find all flights where arr_delay is missing but dep_delay is not. Find all flights where neither arr_time nor sched_arr_time are missing, but arr_delay is.

#Find all flights where arr_delay is missing but dep_delay is not.
flights |>
  filter(is.na(arr_delay & ! dep_delay))
## # A tibble: 8,303 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1       NA           1630        NA       NA           1815
##  2  2013     1     1       NA           1935        NA       NA           2240
##  3  2013     1     1       NA           1500        NA       NA           1825
##  4  2013     1     1       NA            600        NA       NA            901
##  5  2013     1     2       NA           1540        NA       NA           1747
##  6  2013     1     2       NA           1620        NA       NA           1746
##  7  2013     1     2       NA           1355        NA       NA           1459
##  8  2013     1     2       NA           1420        NA       NA           1644
##  9  2013     1     2       NA           1321        NA       NA           1536
## 10  2013     1     2       NA           1545        NA       NA           1910
## # ℹ 8,293 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
#Find all flights where neither arr_time nor sched_arr_time are missing
nycflights13::flights |>
  filter(is.na(sched_arr_time & arr_time & !arr_delay))
## # A tibble: 9,430 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1     1525           1530        -5     1934           1805
##  2  2013     1     1     1528           1459        29     2002           1647
##  3  2013     1     1     1740           1745        -5     2158           2020
##  4  2013     1     1     1807           1738        29     2251           2103
##  5  2013     1     1     1939           1840        59       29           2151
##  6  2013     1     1     1952           1930        22     2358           2207
##  7  2013     1     1     2016           1930        46       NA           2220
##  8  2013     1     1       NA           1630        NA       NA           1815
##  9  2013     1     1       NA           1935        NA       NA           2240
## 10  2013     1     1       NA           1500        NA       NA           1825
## # ℹ 9,420 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

2.How many flights have a missing dep_time? What other variables are missing in these rows? What might these rows represent?

#How many flights have a missing dep_time
flights |>
  count(is.na(dep_time))
## # A tibble: 2 × 2
##   `is.na(dep_time)`      n
##   <lgl>              <int>
## 1 FALSE             328521
## 2 TRUE                8255
#What other variables are missing in these rows?
summary(flights)
##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##                                                  NA's   :8255                 
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
##  Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##  NA's   :8255      NA's   :8713                  NA's   :9430      
##    carrier              flight       tailnum             origin         
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##      dest              air_time        distance         hour      
##  Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
##  Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
##  Mode  :character   Median :129.0   Median : 872   Median :13.00  
##                     Mean   :150.7   Mean   :1040   Mean   :13.18  
##                     3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##                     NA's   :9430                                  
##      minute        time_hour                     
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00.00  
##  1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00.00  
##  Median :29.00   Median :2013-07-03 10:00:00.00  
##  Mean   :26.23   Mean   :2013-07-03 05:22:54.64  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00.00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00.00  
## 

-These rows might represent the cancelled flight

3.Assuming that a missing dep_time implies that a flight is cancelled, look at the number of cancelled flights per day. Is there a pattern? Is there a connection between the proportion of cancelled flights and the average delay of non-cancelled flights?

#definitely cancelled.
cancelled_per_day <-
  flights %>%
  mutate(cancelled = (is.na(arr_delay) | is.na(dep_delay))) %>%
  group_by(year, month, day) %>%
  summarise(
    cancelled_num = sum(cancelled),
    flights_num = n(),
  )
# It is likely that days with more flights would have a higher probability of cancellations

ggplot(cancelled_per_day) +
  geom_point(aes(x = flights_num, y = cancelled_num))

#Is there a connection between the proportion of cancelled flights and the average delay of non-cancelled flights?

flights %>% group_by(month, day) %>%
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE),
            prop_cancelled = sum(is.na(dep_time)/n())) %>%
  ggplot(mapping = aes(x = avg_dep_delay, y = prop_cancelled)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE)

13.4 Summaries

13.4.1 Logical summaries

-There are two main logical summaries: any() and all(). any(x) is the equivalent of |; it’ll return TRUE if there are any TRUE’s in x. all(x) is equivalent of &; it’ll return TRUE only if all values of x are TRUE’s.

13.4.4 Exercises

1.What will sum(is.na(x)) tell you? How about mean(is.na(x))?

-sum(is.na(x)) will return the number of NAs in x by the number of TRUES. mean() gives the proportion of NAs in x by the form of TRUES

2.What does prod() return when applied to a logical vector? What logical summary function is it equivalent to? What does min() return when applied to a logical vector? What logical summary function is it equivalent to? Read the documentation and perform a few experiments.

-When applied to a lgocial vector, prod() will return the product of all the elements in the vector, treating TRUE as 1 and FALSE as 0. It equals to the & -When applied the min() function to a logical vector, it returns FALSE if there are any FALSE values in the vector, and TRUE if all values are TRUE. It’s equivalent to using the all() function.

13.5 Conditional transformations

13.5.1 if_else()

-In if_else(), the first argument, condition, is a logical vector, the second, true, gives the output when the condition is true, and the third, false, gives the output if the condition is false. There’s an optional fourth argument, missing which will be used if the input is NA.

13.5.2 case_when()

-It takes pairs that look like condition ~ output. condition must be a logical vector; when it’s TRUE, output will be used.

13.5.3 Compatible types

-Note that both if_else() and case_when() require compatible types in the output.

Exerciese

  1. A number is even if it’s divisible by two, which in R you can find out with x %% 2 == 0. Use this fact and if_else() to determine whether each number between 0 and 20 is even or odd.
numbers <- 0:20
result <- if_else(numbers %% 2 == 0, "Even", "Odd")
result
##  [1] "Even" "Odd"  "Even" "Odd"  "Even" "Odd"  "Even" "Odd"  "Even" "Odd" 
## [11] "Even" "Odd"  "Even" "Odd"  "Even" "Odd"  "Even" "Odd"  "Even" "Odd" 
## [21] "Even"
  1. Given a vector of days like x <- c(“Monday”, “Saturday”, “Wednesday”), use an ifelse() statement to label them as weekends or weekdays.
x <- c("Monday", "Saturday", "Wednesday")
result <- ifelse(x %in% c("Saturday", "Sunday"), "Weekend", "Weekday")
result
## [1] "Weekday" "Weekend" "Weekday"
  1. Use ifelse() to compute the absolute value of a numeric vector called x.
x <- c(-5, 3, -7, 8, -2, 4, 6, -64, 15, -17)
abs_x <- ifelse(x < 0, -x, x)
abs_x
##  [1]  5  3  7  8  2  4  6 64 15 17
  1. Write a case_when() statement that uses the month and day columns from flights to label a selection of important US holidays (e.g., New Years Day, 4th of July, Thanksgiving, and Christmas). First create a logical column that is either TRUE or FALSE, and then create a character column that either gives the name of the holiday or is NA.
flights %>%
  mutate(
    is_holiday = case_when(
      (month == 1 & day == 1) ~ TRUE,          # New Year's Day
      (month == 7 & day == 4) ~ TRUE,          # 4th of July
      (month == 11 & day == 22) ~ TRUE,       # Thanksgiving
      (month == 12 & day == 25) ~ TRUE,       # Christmas
      TRUE ~ FALSE                            # Non-holiday
    ),
    holiday_name = case_when(
      is_holiday ~ "Holiday",
      TRUE ~ NA_character_))
## # A tibble: 336,776 × 21
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, is_holiday <lgl>,
## #   holiday_name <chr>

14 Numbers

14.1 Introduction

-We’ll start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of count(). Then we’ll dive into various numeric transformations that pair well with mutate(), including more general transformations that can be applied to other types of vectors, but are often used with numeric vectors. We’ll finish off by covering the summary functions that pair well with summarize() and show you how they can also be used with mutate().

14.1.1 Prerequisites

library(tidyverse)
library(nycflights13)

14.2 Making numbers

x <- c("1.2", "5.6", "1e3")
parse_double(x)
## [1]    1.2    5.6 1000.0
# Use parse_number() when the string contains non-numeric text that you want to ignore.
x <- c("$1,234", "USD 3,513", "59%")
parse_number(x)
## [1] 1234 3513   59

14.3 Counts

#How count() works
flights |> count(dest)
## # A tibble: 105 × 2
##    dest      n
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     265
##  3 ALB     439
##  4 ANC       8
##  5 ATL   17215
##  6 AUS    2439
##  7 AVL     275
##  8 BDL     443
##  9 BGR     375
## 10 BHM     297
## # ℹ 95 more rows
#If you want to see the most common values, add sort = TRUE
flights |> count(dest, sort = TRUE)
## # A tibble: 105 × 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705
## # ℹ 95 more rows
#if you want to see all the values, you can use |> View() or |> print(n = Inf).
#You can perform the same computation “by hand” with group_by(), summarize() and n(). 

flights |> 
  group_by(dest) |> 
  summarize(
    n = n(),
    delay = mean(arr_delay, na.rm = TRUE))
## # A tibble: 105 × 3
##    dest      n delay
##    <chr> <int> <dbl>
##  1 ABQ     254  4.38
##  2 ACK     265  4.85
##  3 ALB     439 14.4 
##  4 ANC       8 -2.5 
##  5 ATL   17215 11.3 
##  6 AUS    2439  6.02
##  7 AVL     275  8.00
##  8 BDL     443  7.05
##  9 BGR     375  8.03
## 10 BHM     297 16.9 
## # ℹ 95 more rows
#n_distinct(x) counts the number of distinct (unique) values of one or more variables. 
flights |> 
  group_by(dest) |> 
  summarize(carriers = n_distinct(carrier)) |> 
  arrange(desc(carriers))
## # A tibble: 105 × 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # ℹ 95 more rows
#A weighted count is a sum. For example you could “count” the number of miles each plane flew:
flights |> 
  group_by(tailnum) |> 
  summarize(miles = sum(distance))
## # A tibble: 4,044 × 2
##    tailnum  miles
##    <chr>    <dbl>
##  1 D942DN    3418
##  2 N0EGMQ  250866
##  3 N10156  115966
##  4 N102UW   25722
##  5 N103US   24619
##  6 N104UW   25157
##  7 N10575  150194
##  8 N105UW   23618
##  9 N107US   21677
## 10 N108UW   32070
## # ℹ 4,034 more rows
#Weighted counts are a common problem so count() has a wt argument that does the same thing:
flights |> count(tailnum, wt = distance)
## # A tibble: 4,044 × 2
##    tailnum      n
##    <chr>    <dbl>
##  1 D942DN    3418
##  2 N0EGMQ  250866
##  3 N10156  115966
##  4 N102UW   25722
##  5 N103US   24619
##  6 N104UW   25157
##  7 N10575  150194
##  8 N105UW   23618
##  9 N107US   21677
## 10 N108UW   32070
## # ℹ 4,034 more rows
#You can count missing values by combining sum() and is.na(). 
flights |> 
  group_by(dest) |> 
  summarize(n_cancelled = sum(is.na(dep_time))) 
## # A tibble: 105 × 2
##    dest  n_cancelled
##    <chr>       <int>
##  1 ABQ             0
##  2 ACK             0
##  3 ALB            20
##  4 ANC             0
##  5 ATL           317
##  6 AUS            21
##  7 AVL            12
##  8 BDL            31
##  9 BGR            15
## 10 BHM            25
## # ℹ 95 more rows

14.3.1 Exercises

1.How can you use count() to count the number rows with a missing value for a given variable?

flights |> 
  group_by(dest) |> 
  count(is.na(dep_time))
## # A tibble: 203 × 3
## # Groups:   dest [105]
##    dest  `is.na(dep_time)`     n
##    <chr> <lgl>             <int>
##  1 ABQ   FALSE               254
##  2 ACK   FALSE               265
##  3 ALB   FALSE               419
##  4 ALB   TRUE                 20
##  5 ANC   FALSE                 8
##  6 ATL   FALSE             16898
##  7 ATL   TRUE                317
##  8 AUS   FALSE              2418
##  9 AUS   TRUE                 21
## 10 AVL   FALSE               263
## # ℹ 193 more rows

2.Expand the following calls to count() to instead use group_by(), summarize(), and arrange(): flights |> count(dest, sort = TRUE)

flights |> count(tailnum, wt = distance)

flights |>
  group_by(dest) |>
  summarize(n = n()) |>
  arrange(desc(n))
## # A tibble: 105 × 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705
## # ℹ 95 more rows
flights |>
  group_by(tailnum) |>
  summarize(total_distance = sum(distance, na.rm = TRUE)) |>
  arrange(desc(total_distance))
## # A tibble: 4,044 × 2
##    tailnum total_distance
##    <chr>            <dbl>
##  1 <NA>           1784167
##  2 N328AA          939101
##  3 N338AA          931183
##  4 N327AA          915665
##  5 N335AA          909696
##  6 N323AA          844529
##  7 N319AA          840510
##  8 N336AA          838086
##  9 N329AA          830776
## 10 N324AA          794895
## # ℹ 4,034 more rows

14.4 Numeric transformations

14.4.1 Arithmetic and recycling rules

  1. Explain in words what each line of the code used to generate Figure 14.1 does.
flights |>  #Designate flight data set for extracting 
  group_by(hour = sched_dep_time %/% 100) |>  #This line groups the data by the "hour" variable, which is created by dividing the "sched_dep_time" column by 100 (to extract the hour portion of the departure time). 
  summarize(prop_cancelled = mean(is.na(dep_time)), n = n()) |>  # It calculates the proportion of canceled flights (prop_cancelled) using the mean of is.na(dep_time) and also calculates the total count of flights (n) in each group.
  filter(hour > 1) |> # Filter out groups where the "hour" is greater than 1
  ggplot(aes(x = hour, y = prop_cancelled)) + #Plot with "hour" on the x-axis and "prop_cancelled" on the y-axis
  geom_line(color = "grey50") + # This line adds a line layer to the plot, creating a line plot. The "color" parameter sets the color of the line to grey50.
  geom_point(aes(size = n))#  Add point markers to the plot, where the size of the points is mapped to the "n" variable. 

  1. What trigonometric functions does R provide? Guess some names and look up the documentation. Do they use degrees or radians?

-sin(x): Returns the sine of the angle x, where x is in radians. -cos(x): Returns the cosine of the angle x, where x is in radians. -tan(x): Returns the tangent of the angle x, where x is in radians. -These functions typically use radians for angle measurement.

  1. Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. You can see the basic problem by running the code below: there’s a gap between each hour.
#Original
flights |> 
  filter(month == 1, day == 1) |> 
  ggplot(aes(x = sched_dep_time, y = dep_delay)) +
  geom_point()

#Modification
flights |>
  filter(month == 1, day == 1) |>
  ggplot(aes(x = sched_dep_time, y = dep_delay)) +
  geom_point() +
  scale_x_continuous(breaks = seq(0, 2400, by = 100))

  1. Convert them to a more truthful representation of time (either fractional hours or minutes since midnight).

Round dep_time and arr_time to the nearest five minutes.

# Convert dep_time and arr_time to fractional hours
flights <- flights |>
  mutate(
    dep_time = floor(dep_time / 100) + (dep_time %% 100) / 60,
    arr_time = floor(arr_time / 100) + (arr_time %% 100) / 60
  )

# Round dep_time and arr_time to the nearest five minutes

flights <- flights |>
  mutate(
    dep_time = round(dep_time * 12) / 12,  # 5 minutes per unit
    arr_time = round(arr_time * 12) / 12
  )

14.5 General transformations

14.5.1 Ranks

#Note that the smallest values get the lowest ranks; use desc(x) to give the largest values the smallest ranks:
x <- c(1, 2, 2, 3, 4, NA)
min_rank(desc(x))
## [1]  5  3  3  2  1 NA

-View documents of dplyr::row_number(), dplyr::dense_rank(), dplyr::percent_rank(), and dplyr::cume_dist()

14.5.2 Offsets

#dplyr::lead() and dplyr::lag() allow you to refer the values just before or just after the “current” value. They return a vector of the same length as the input, padded with NAs at the start or end:

x <- c(2, 5, 11, 11, 19, 35)
lag(x)
## [1] NA  2  5 11 11 19
lead(x)
## [1]  5 11 11 19 35 NA
#x - lag(x) gives you the difference between the current and previous value.
x - lag(x)
## [1] NA  3  6  0  8 16
#x == lag(x) tells you when the current value changes.
x == lag(x)
## [1]    NA FALSE FALSE  TRUE FALSE FALSE

14.5.3 Consecutive identifiers

  • cumsum() will increment group by one.
  • Another approach for creating grouping variables is consecutive_id(), which starts a new group every time one of its arguments changes.
  • To keep the first row from each repeated x, you could use group_by(), consecutive_id(), and slice_head()

Exercise

1.Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

flights |>
  arrange(desc(dep_delay)) |> 
  mutate(rank = min_rank(desc(dep_delay))) |>  
  filter(rank <= 10) 
## # A tibble: 10 × 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <dbl>          <int>     <dbl>    <dbl>          <int>
##  1  2013     1     9     6.67            900      1301    12.7            1530
##  2  2013     6    15    14.5            1935      1137    16.1            2120
##  3  2013     1    10    11.3            1635      1126    12.7            1810
##  4  2013     9    20    11.7            1845      1014    14.9            2210
##  5  2013     7    22     8.75           1600      1005    10.8            1815
##  6  2013     4    10    11              1900       960    13.7            2211
##  7  2013     3    17    23.3             810       911     1.58           1020
##  8  2013     6    27    10              1900       899    12.6            2226
##  9  2013     7    22    22.9             759       898     1.33           1026
## 10  2013    12     5     7.92           1700       896    11              2020
## # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, rank <int>

2.Which plane (tailnum) has the worst on-time record?

flights |>
  group_by(tailnum) |>
  summarize(average_delay = mean(dep_delay, na.rm = TRUE)) |>
  arrange(average_delay, na.last = TRUE)
## # A tibble: 4,044 × 2
##    tailnum average_delay
##    <chr>           <dbl>
##  1 N785SK          -14  
##  2 N710SK          -13  
##  3 N701SK          -11  
##  4 N726SK          -11  
##  5 N859AS          -11  
##  6 N17627          -10.5
##  7 N14628          -10  
##  8 N794SK          -10  
##  9 N583AS           -9.5
## 10 N509AA           -9  
## # ℹ 4,034 more rows

3.What time of day should you fly if you want to avoid delays as much as possible?

flights <- flights |>
  mutate(hour = as.numeric(substring(sched_dep_time, 1, 2)))

average_delay_by_hour <- flights |>
  group_by(hour) |>
  summarize(average_delay = mean(dep_delay, na.rm = TRUE))

average_delay_by_hour |>
  filter(average_delay == min(average_delay))
## # A tibble: 1 × 2
##    hour average_delay
##   <dbl>         <dbl>
## 1    50         -3.11

4.What does flights |> group_by(dest) |> filter(row_number() < 4) do? What does flights |> group_by(dest) |> filter(row_number(dep_delay) < 4) do?

-The first line of codes filters the grouped data to keep only the rows where the row number (order within each destination group) is less than 4 by ‘dest’ -The second line filters the data based on the row numbers within each destination group considering the ‘dep_delay’ column. It selects the first three rows within each destination group

5.For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

#Calculate the total minutes of delay for each destination
destination_dt <- flights |>
  group_by(dest) |>
  summarize(total_delay = sum(dep_delay, na.rm = TRUE))

#Join the origianl dataset
new_flights <- flights |>
  left_join(destination_dt, by = "dest")

#Calculate the proportion of the total delay for each flight's destination
new_flights |>
  mutate(proportion_of_total_delay = dep_delay / total_delay)
## # A tibble: 336,776 × 21
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <dbl>          <int>     <dbl>    <dbl>          <int>
##  1  2013     1     1     5.25            515         2     8.5             819
##  2  2013     1     1     5.58            529         4     8.83            830
##  3  2013     1     1     5.67            540         2     9.42            850
##  4  2013     1     1     5.75            545        -1    10.1            1022
##  5  2013     1     1     5.92            600        -6     8.17            837
##  6  2013     1     1     5.92            558        -4     7.67            728
##  7  2013     1     1     5.92            600        -5     9.25            854
##  8  2013     1     1     5.92            600        -3     7.17            723
##  9  2013     1     1     5.92            600        -3     8.67            846
## 10  2013     1     1     6               600        -2     7.92            745
## # ℹ 336,766 more rows
## # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, total_delay <dbl>,
## #   proportion_of_total_delay <dbl>

6.Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the average flight delay for an hour is related to the average delay for the previous hour.

#The original set
flights |> 
  mutate(hour = dep_time %/% 100) |> 
  group_by(year, month, day, hour) |> 
  summarize(
    dep_delay = mean(dep_delay, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  ) |> 
  filter(n > 5)
## # A tibble: 578 × 6
##     year month   day  hour dep_delay     n
##    <int> <int> <int> <dbl>     <dbl> <int>
##  1  2013     1     1     0     11.5    838
##  2  2013     1     2     0     13.9    935
##  3  2013     1     2    NA    NaN        8
##  4  2013     1     3     0     11.0    904
##  5  2013     1     3    NA    NaN       10
##  6  2013     1     4     0      8.95   909
##  7  2013     1     4    NA    NaN        6
##  8  2013     1     5     0      5.73   717
##  9  2013     1     6     0      7.15   831
## 10  2013     1     7     0      5.42   930
## # ℹ 568 more rows
#Modified
flights |>
  mutate(hour = dep_time %/% 100) |>
  group_by(year, month, day, hour) |>
  summarize(
    dep_delay = mean(dep_delay, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  ) |>
  filter(n > 5) |>
  mutate(prev_hour_delay = lag(dep_delay)) |>
  na.omit()
## # A tibble: 152 × 7
##     year month   day  hour dep_delay     n prev_hour_delay
##    <int> <int> <int> <dbl>     <dbl> <int>           <dbl>
##  1  2013     1     2     0    13.9     935           11.5 
##  2  2013     1     6     0     7.15    831            5.73
##  3  2013     1     7     0     5.42    930            7.15
##  4  2013     1     8     0     2.55    895            5.42
##  5  2013     1     9     0     2.28    897            2.55
##  6  2013     1    10     0     2.84    929            2.28
##  7  2013     1    11     0     2.82    919            2.84
##  8  2013     1    15     0     0.124   881            2.79
##  9  2013     1    20     0     6.78    782            3.48
## 10  2013     1    21     0     7.83    904            6.78
## # ℹ 142 more rows

-It seemingly the higher previous_hours_delay would cause higher average flight delay for an hour.

7.Look at each destination. Can you find flights that are suspiciously fast (i.e. flights that represent a potential data entry error)? Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

#Find the shortest flight to each destination
shortest_flight <- flights |>
  group_by(dest) |>
  mutate(shortest_time = min(air_time), 
         mean_time = mean(air_time)) |>
  ungroup() |>
  mutate(diff_from_short = air_time-shortest_time, 
         diff_from_mean = air_time-mean_time) |>
  arrange(diff_from_mean) |>
  select(dest, shortest_time, air_time, diff_from_mean, diff_from_short, tailnum)
#Compute the air time of a flight relative to the shortest flight to that destination.
#flights |>
  #left_join(shortest_flight, by = "dest") |> 
  #mutate(relative_air_time = air_time / shortest_time) |>
  #arrange(desc(relative_air_time)) |>
  #head(10)

-Flight N729JB, N531JB, and N566JB are the top three most delayed in the air ?

8.Find all destinations that are flown by at least two carriers. Use those destinations to come up with a relative ranking of the carriers based on their performance for the same destination.

#Find all destinations that are flown by at least two carriers.
destinations_with_twomore_carriers <- flights |>
  group_by(dest) |>
  mutate(carrier_count = n_distinct(carrier)) |>
  filter(carrier_count >= 2) |>
  distinct(dest)

#Lets see the relative ranking of the carriers based on their performance for the same destination.
flights |>
  filter(dest %in% destinations_with_twomore_carriers$dest) |>
  group_by(carrier, dest) |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) |>
  group_by(carrier) |>
  summarize(relative_rank = mean(avg_dep_delay, na.rm = TRUE)) |>
  arrange(relative_rank)
## `summarise()` has grouped output by 'carrier'. You can override using the
## `.groups` argument.
## # A tibble: 16 × 2
##    carrier relative_rank
##    <chr>           <dbl>
##  1 US               3.85
##  2 HA               4.90
##  3 AS               5.80
##  4 VX               7.01
##  5 DL               7.17
##  6 FL               8.37
##  7 AA               9.90
##  8 MQ              11.2 
##  9 YV              12.0 
## 10 UA              12.5 
## 11 9E              12.5 
## 12 B6              13.0 
## 13 WN              15.6 
## 14 EV              19.4 
## 15 F9              20.2 
## 16 OO              27.6

-Seemingly the UA has the worst performance for the same destination.

14.6 Numeric summaries

14.6.1 Center

-mean(), median()

14.6.2 Minimum, maximum, and quantiles

-min() and max() will give you the largest and smallest values.

-quantile() is a generalization of the median: quantile(x, 0.25) will find the value of x that is greater than 25% of the values, quantile(x, 0.5) is equivalent to the median, and quantile(x, 0.95) will find the value that’s greater than 95% of the values.

14.6.3 Spread

-IQR() might be new — it’s quantile(x, 0.75) - quantile(x, 0.25) and gives you the range that contains the middle 50% of the data.

14.6.4 Distributions

-geom_freqpoly() can help create distribution

14.6.5 Positions

-Extracting a value at a specific position: first(x), last(x), and nth(x, n) -Because dplyr functions use _ to separate components of function and arguments names, these functions use na_rm instead of na.rm.

14.6.6 With mutate()

-x / sum(x) calculates the proportion of a total. -(x - mean(x)) / sd(x) computes a Z-score (standardized to mean 0 and sd 1). -(x - min(x)) / (max(x) - min(x)) standardizes to range [0, 1]. -x / first(x) computes an index based on the first observation.

14.6.7 Exercises (WARN)

  1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. When is mean() useful? When is median() useful? When might you want to use something else? Should you use arrival delay or departure delay? Why might you want to use data from planes?

-When is mean() useful?: when I want to get an overall sense of the typical delay in a group of flights, and understand the central tendency of the data

  • When is median() useful?: In the case we want to assess the central value that separates the higher half of delays from the lower half.

-When might you want to use something else? :When I want to specify the data, like checking calculating quantiles (e.g., 25th percentile, 75th percentile) or percentiles to pick up performance on specific flight.

  1. Which destinations show the greatest variation in air speed?
flights |>
  group_by(dest) |>
  summarize(variation = sd(distance/air_time, na.rm = TRUE)) |>
  arrange(desc(variation)) |>
  head(5)
## # A tibble: 5 × 2
##   dest  variation
##   <chr>     <dbl>
## 1 OKC       0.639
## 2 TUL       0.624
## 3 ILM       0.615
## 4 BNA       0.615
## 5 CLT       0.611

-The OKC shows the greatest variation in air speed

3.Create a plot to further explore the adventures of EGE. Can you find any evidence that the airport moved locations? Can you find another variable that might explain the difference? (Why this is empty?)

```{r,message = FALSE} EGE_flights <- flights |> filter(dest == “EGE”)

EGE_flights |> group_by(year) |> summarize(num_flights = n()) |> ggplot(aes(x = year, y = num_flights)) + geom_line() + labs(x = “Year”, y = “Number of Flights”)



# 15  Strings

## 15.1.1 Prerequisites

```r
library(tidyverse)
library(babynames)

15.2 Creating a string

-You can create a string using either single quotes (’) or double quotes (“). There’s no difference in behavior between the two, so in the interests of consistency, the tidyverse style guide recommends using”, unless the string contains multiple “.

15.2.1 Escapes

#To include a literal single or double quote in a string, you can use \ to “escape” it:
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
double_quote
## [1] "\""
single_quote
## [1] "'"
#So if you want to include a literal backslash in your string, you’ll need to escape it: "\\":
backslash <- "\\"
backslash
## [1] "\\"
#To see the raw contents of the string, use str_view()
x <- c(single_quote, double_quote, backslash)
x
## [1] "'"  "\"" "\\"
str_view(x)
## [1] │ '
## [2] │ "
## [3] │ \

15.2.2 Raw strings

-A raw string usually starts with r”( and finishes with )“. But if your string contains )” you can instead use r”[]” or r”{}“, and if that’s still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g., r”–()–“, r”—()—“, etc.

15.2.3 Other special characters

-The most common are , a new line, and tab. You’ll also sometimes see strings containing Unicode escapes that start with r

15.2.4 Exercises

  1. Create strings that contain the following values:

He said “That’s amazing!”

\\

t1524 <- r"('He said "That's amazing!"'
"\a\b\c\d"
"\\\\\\")" 
t1524
## [1] "'He said \"That's amazing!\"'\n\"\\a\\b\\c\\d\"\n\"\\\\\\\\\\\\\""
str_view(t1524)
## [1] │ 'He said "That's amazing!"'
##     │ "\a\b\c\d"
##     │ "\\\\\\"
  1. Create the string in your R session and print it. What happens to the special “0a0”? How does str_view() display it? Can you do a little googling to figure out what this special character is?
x <- "This\u00a0is\u00a0tricky"
x
## [1] "This is tricky"
str_view(x)
## [1] │ This{\u00a0}is{\u00a0}tricky
#lets try
x <- c("This", "\u00a0", "is", "\u00a0", "tricky")
x
## [1] "This"   " "      "is"     " "      "tricky"

-“0a0” does not generate results, and it is NO-BREAK SPACE!

15.3 Creating many strings from data

15.3.1 str_c()

-str_c() takes any number of vectors as arguments and returns a character vector

#If you want missing values to display in another way, use coalesce() to replace them. Depending on what you want, you might use it either inside or outside of str_c():
df <- tibble(name = c("Flora", "David", "Terra", NA))
df |> 
  mutate(
    greeting1 = str_c("Hi ", coalesce(name, "you"), "!"),
    greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!")
  )
## # A tibble: 4 × 3
##   name  greeting1 greeting2
##   <chr> <chr>     <chr>    
## 1 Flora Hi Flora! Hi Flora!
## 2 David Hi David! Hi David!
## 3 Terra Hi Terra! Hi Terra!
## 4 <NA>  Hi you!   Hi!

15.3.2 str_glue()

  • str_glue() converts missing values to the string “NA”.

15.3.3 str_flatten()

  • str_flatten() takes a character vector and combines each element of the vector into a single string, and work well with summarize()

15.3.4 Exercises

1.Compare and contrast the results of paste0() with str_c() for the following inputs:

str_c("hi ", NA)
## [1] NA
paste0("hi ", NA)
## [1] "hi NA"
paste0(letters[1:2], letters[1:3])
## [1] "aa" "bb" "ac"

In the first case, paste9() treat NA as a string, and return “hi NA”. In the second case, In the second case, the str_c() cannot recycle the designated values in letters.

  1. What’s the difference between paste() and paste0()? How can you recreate the equivalent of paste() with str_c()?

-?paste(), ?paste0() -paste0() is similar to paste(), but it has no separator. The return of paste0() will have no blank between values.

  1. Convert the following expressions from str_c() to str_glue() or vice versa:

str_c(“The price of”, food, ” is “, price)

str_glue(“I’m {age} years old and live in {country}”)

str_c(“\section{”, title, “}”)

-food <- c(‘food’) -price <-c(‘price’) -age <- c(‘age’) -country <- c(‘country’)

-str_glue(“The price of {food} is {price}”)

-str_c(“I’m”, age, ” years old and live in “, country)

-str_glue(“\section{{{title}}}”)

15.4 Extracting data from strings

15.4.1 Separating into rows

-The most common case is requiring separate_longer_delim() to split based on a delimiter.

-separate_longer_position() is suitable for dataset with a very compact format where each character is used to record a value

15.4.2 Separating into columns

-separate_wider_delim() can separate a string into columns, but it needs the delimiter and the names in the arguments.

-In the argument, you can use an NA name to omit it from results.

-separate_wider_position() works a little differently because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column, and the value is the number of characters it occupies. You can omit values from the output by not naming them

15.4.3 Diagnosing widening problems

-separate_wider_delim() provides two arguments to help if some of the rows don’t have the expected number of pieces: too_few and too_many.

  • too_few = “debug” to ensure that new problems become errors. too_few = “align_start” and too_few = “align_end” fill in the missing pieces with NAs and move on.

15.5 Letters

15.5.1 Length

-str_length() tells you the number of letters in the string

15.5.2 Subsetting

-You can extract parts of a string using str_sub(string, start, end), where start and end are the positions where the substring should start and end.

Exercises (WARN)

  1. We could use str_sub() with mutate() to find the first and last letter of each name (dont forget place the position of rows.)
babynames <- babynames::babynames
babynames |>
  mutate(
    first_letter = str_sub(name, 1, 1),
    last_letter = str_sub(name, -1, -1)
  )
## # A tibble: 1,924,665 × 7
##     year sex   name          n   prop first_letter last_letter
##    <dbl> <chr> <chr>     <int>  <dbl> <chr>        <chr>      
##  1  1880 F     Mary       7065 0.0724 M            y          
##  2  1880 F     Anna       2604 0.0267 A            a          
##  3  1880 F     Emma       2003 0.0205 E            a          
##  4  1880 F     Elizabeth  1939 0.0199 E            h          
##  5  1880 F     Minnie     1746 0.0179 M            e          
##  6  1880 F     Margaret   1578 0.0162 M            t          
##  7  1880 F     Ida        1472 0.0151 I            a          
##  8  1880 F     Alice      1414 0.0145 A            e          
##  9  1880 F     Bertha     1320 0.0135 B            a          
## 10  1880 F     Sarah      1288 0.0132 S            h          
## # ℹ 1,924,655 more rows
  1. When computing the distribution of the length of babynames, why did we use wt = n? Use str_length() and str_sub() to extract the middle letter from each baby name. What will you do if the string has an even number of characters?

-We use wt = n becasue it is a simple way to count the occurrences of each unique name of babies (?)

babynames |>
  mutate(
    middle_letter = ifelse(str_length(name) %% 2 == 1, 
                            str_sub(name, str_length(name) %/% 2 + 1,
                                    str_length(name) %/% 2 + 1),
                           str_sub(name, str_length(name) %/% 2,
                                   str_length(name) %/% 2 + 1) ))
## # A tibble: 1,924,665 × 6
##     year sex   name          n   prop middle_letter
##    <dbl> <chr> <chr>     <int>  <dbl> <chr>        
##  1  1880 F     Mary       7065 0.0724 ar           
##  2  1880 F     Anna       2604 0.0267 nn           
##  3  1880 F     Emma       2003 0.0205 mm           
##  4  1880 F     Elizabeth  1939 0.0199 a            
##  5  1880 F     Minnie     1746 0.0179 nn           
##  6  1880 F     Margaret   1578 0.0162 ga           
##  7  1880 F     Ida        1472 0.0151 d            
##  8  1880 F     Alice      1414 0.0145 i            
##  9  1880 F     Bertha     1320 0.0135 rt           
## 10  1880 F     Sarah      1288 0.0132 r            
## # ℹ 1,924,655 more rows
  1. Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?
#Part 1
library(babynames)
babynames1 <- babynames |>
  group_by(year) |>
  mutate(average_name_length = mean(nchar(name)))

ggplot(data = babynames1, aes(x = year, y = average_name_length)) +
  geom_line() +
  labs(x = "Year", y = "Average Name Length") +
  ggtitle("Trends in the Length of Baby Names Over Time")

#Part 2
babynames2 <- babynames |>
  mutate(first_letter = str_sub(name, 1, 1),
         last_letter = str_sub(name, -1, -1)) |>
  group_by(first_letter) |>
  mutate(first_letter_count = n()) |>
  group_by(last_letter) |>
  mutate(last_letter_count = n())

ggplot(data = babynames2, aes(x = first_letter, y = first_letter_count)) +
  geom_bar(stat = "identity") +
  labs(x = "First Letter", y = "Count") +
  ggtitle("Popularity of First Letters")

ggplot(data = babynames2, aes(x = last_letter, y = last_letter_count)) +
  geom_bar(stat = "identity") +
  labs(x = "Last Letter", y = "Count") +
  ggtitle("Popularity of Last Letters")

15.6 Non-English text

15.6.1 Encoding

-encoding = () add certain names of languages.

15.6.2 Letter variations

-Working in languages with accents poses a significant challenge when determining the position of letters (e.g., with str_length() and str_sub())

-Note that a comparison of these strings with == interprets these strings as different, while the handy str_equal() function in stringr recognizes that both have the same appearance

-locale = can help adapt different languages’ unique formats.

16 Regular expressions

16.1 Introduction

library(tidyverse)
library(babynames)

16.3 Key functions

16.3.1 Detect matches

-str_detect() returns a logical vector that is TRUE if the pattern matches an element of the character vector and FALSE otherwise

16.3.2 Count matches

-str_count()tells you how many matches there are in each string.

  • str_to_lower() convert words to lower case.

16.3.3 Replace values

-str_replace() replaces the first match, and as the name suggests, str_replace_all() replaces all matches

-str_remove() and str_remove_all() are handy shortcuts for str_replace(x, pattern, ““)

16.3.4 Extract variables

-To extract this data using separate_wider_regex() we just need to construct a sequence of regular expressions that match each piece. If we want the contents of that piece to appear in the output, we give it a name. “separate_wider_regex( str, patterns = c(”<“, name =”[A-Za-z]+“,”>-“, gender =”.”, “_“, age =”[0-9]+” ) )”

16.3 Exercises (warn)

1.What baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)

#Baby name with the most vowels
babynames |>
  mutate(vowel_count = str_count(name, "[aeiouAEIOU]")) |>
  filter(vowel_count == max(vowel_count)) |>
  distinct(name)
## # A tibble: 2 × 1
##   name           
##   <chr>          
## 1 Mariaguadalupe 
## 2 Mariadelrosario
#Baby name has the highest proportion of vowels
babynames |>
  mutate(vowel_count = str_count(name, "[aeiouAEIOU]"))|>
  mutate(vowel_proportion = vowel_count / nchar(name)) |>
  filter(vowel_proportion == max(vowel_proportion)) |>
  select(name, vowel_proportion)
## # A tibble: 110 × 2
##    name  vowel_proportion
##    <chr>            <dbl>
##  1 Eua                  1
##  2 Eua                  1
##  3 Eua                  1
##  4 Eua                  1
##  5 Ea                   1
##  6 Ai                   1
##  7 Ai                   1
##  8 Ai                   1
##  9 Ia                   1
## 10 Ai                   1
## # ℹ 100 more rows

2.Replace all forward slashes in “a/b/c/d/e” with backslashes. What happens if you attempt to undo the transformation by replacing all backslashes with forward slashes? (We’ll discuss the problem very soon.)

original<- "F/-/1/5/E"
replaced<- gsub("/", "\\", original)
undo_string <- gsub("\\\\", "/", replaced)
replaced
## [1] "F-15E"
undo_string
## [1] "F-15E"

-Nothing changes when attempted to undo the transformation by replacing all backslashes with forward slashes.

3.Implement a simple version of str_to_lower() using str_replace_all().

replacements <- c(
  "A" = "a", "B" = "b", "C" = "c", "D" = "d", "E" = "e",
  "F" = "f", "G" = "g", "H" = "h", "I" = "i", "J" = "j",
  "K" = "k", "L" = "l", "M" = "m", "N" = "n", "O" = "o",
  "P" = "p", "Q" = "q", "R" = "r", "S" = "s", "T" = "t",
  "U" = "u", "V" = "v", "W" = "w", "X" = "x", "Y" = "y",
  "Z" = "z"
)
lower_words <- str_replace_all(words, pattern = replacements)
head(lower_words)
## [1] "a"        "able"     "about"    "absolute" "accept"   "account"

4.Create a regular expression that will match telephone numbers as commonly written in your country.

x <- c("13562475567")
str_view(x, "\\d{3}-\\d{4}-\\d{4}")

16.4 Pattern details

16.4.1 Escaping

-We use strings to represent regular expressions, and  is also used as an escape symbol in strings. So to create the regular expression . we need the string “\.”

16.4.2 Anchors

-Anchor the regular expression using ^ to match the start or $ to match the end

-To force a regular expression to match only the full string, anchor it with both ^ and $

-Match the boundary between words (i.e. the start or end of a word) with

16.4.3 Character classes

-There are many pairs for characters(cannot remember them all by now).

16.4.4 Quantifiers

-{n} matches exactly n times.

-{n,} matches at least n times.

-{n,m} matches between n and m times.

16.4.5 Operator precedence and parentheses

-quantifiers have high precedence and alternation has low precedence which means that ab+ is equivalent to a(b+), and ^a|b$ is equivalent to (^a)|(b$). ### 16.4.6 Grouping and capturing -\1 refers to the match contained in the first parenthesis, \2 in the second parenthesis, and so on.

16.4.7 Exercises

1.How would you match the literal string “’? How about”\(^\)“?

For “‘: "\’\\, \’ to express’, \\ to express
For”\(^\)“: \\(\\^\\\)

2.Explain why each of these patterns don’t match a : “",”\“,”\".

This is because the backslash character is a special character in regular expressions and needs to be escaped to be treated as a literal character.

For “": treat backslash as an escape character rather than a literal backslash

For “\”: the backslash character is not escaped, so the regular expression engine interprets it as an escape character and expects another character to follow it. But there is no following.

For “\": The first backslash”" is used to escape the second backslash “". This means the pattern is looking for a literal backslash. But”\” xpects another character to follow it, while there is no following.

3.Given the corpus of common words in stringr::words, create regular expressions that find all words that:

a.Start with “y”. (“^y”)

b.Don’t start with “y”. (“1”)

c.End with “x”. (“x$”)

d.Are exactly three letters long. (Don’t cheat by using str_length()!) (“2{3}$”)

e.Have seven letters or more. (“3{7,}$”)

f.Contain a vowel-consonant pair. (“[aeiou][^aeiou]”)

g.Contain at least two vowel-consonant pairs in a row. (“[aeiou][^aeiou][aeiou][^aeiou]”)

h.Only consist of repeated vowel-consonant pairs. (“^(?:[aeiou][^aeiou])+$”)

4.Create 11 regular expressions that match the British or American spellings for each of the following words: airplane/aeroplane, aluminum/aluminium, analog/analogue, ass/arse, center/centre, defense/defence, donut/doughnut, gray/grey, modeling/modelling, skeptic/sceptic, summarize/summarise. Try and make the shortest possible regex!

a(ero)?plane

alumin(ium|um)

analog(ue)?

ar?se

cent(re|er)

defen(s|c)e

d(ough)?nut

gr(a|e)y

model(ling)?

ske(ptic|ptical)?

summar(ize|ise)

5.Switch the first and last letters in words. Which of those strings are still words?

switched <- str_replace(words, "^(.)(.*)(.)$", "\\3\\2\\1")
words[words %in% switched]
##  [1] "a"          "america"    "area"       "dad"        "dead"      
##  [6] "deal"       "dear"       "depend"     "dog"        "educate"   
## [11] "else"       "encourage"  "engine"     "europe"     "evidence"  
## [16] "example"    "excuse"     "exercise"   "expense"    "experience"
## [21] "eye"        "god"        "health"     "high"       "knock"     
## [26] "lead"       "level"      "local"      "nation"     "no"        
## [31] "non"        "on"         "rather"     "read"       "refer"     
## [36] "remember"   "serious"    "stairs"     "test"       "tonight"   
## [41] "transport"  "treat"      "trust"      "window"     "yesterday"

6.Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)

  1. ^.*$ This expression matches any string or line that contains any character, even it is a line or empty.

  2. “\{.+\}” This regular expression matches a string that contains a pair of curly braces (i.e., { and }) with one or more characters in between.

  3. -- This regular expression matches a date format in the form of “YYYY-MM-DD,” where epresents a digit (0-9).

  4. “\\{4}” This regular expression matches the literal string “{4}” within double quotes. It looks for the exact sequence of characters, including the escape character .

  5. ...... This regular expression matches a string containing three periods (dots), separated by any character.

  6. (.)\1\1 This regular expression matches any character followed by two identical characters. The (.) captures any character, and \1\1 checks if the next two characters are the same as the first character captured.

  7. “(..)\1” his regular expression matches a string enclosed in double quotes that consists of two identical characters followed by another two identical characters. It captures the first two characters and checks if the next two are the same.

7.Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.

16.5 Pattern control

16.5.1 Regex flags

  • The most useful flag is probably ignore_case = TRUE because it allows characters to match either their uppercase or lowercase forms

  • dotall = TRUE lets . match everything, including

  • multiline = TRUE makes ^ and $ match the start and end of each line rather than the start and end of the complete string

-comments = TRUE tweaks the pattern language to ignore spaces and new lines, as well as everything after #. This allows you to use comments and whitespace to make complex regular expressions more understandable

16.5.2 Fixed matches

-You can opt-out of the regular expression rules by using fixed()

-fixed() also gives you the ability to ignore case

-If you’re working with non-English text, you will probably want coll() instead of fixed(), as it implements the full rules for capitalization as used by the locale you specify.

16.6 Practice

16.6.1 Check your work

  • Step by step construct your pattern in finding the target matches.

  • And pay attention to the details in overlapping other strings.

16.6.2 Boolean operations

  • Imagine we want to find words that only contain consonants. One technique is to create a character class that contains all letters except for the vowels ([^aeiou]), then allow that to match any number of letters ([^aeiou]+), then force it to match the whole string by anchoring to the beginning and the end (4+$)

  • But you can make this problem a bit easier by flipping the problem around. Instead of looking for words that contain only consonants, we could look for words that don’t contain any vowels

-If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.

16.6.3 Creating a pattern with code

-create the pattern from the vector using str_c() and str_flatten()

-whenever you create patterns from existing strings it’s wise to run them through str_escape() to ensure they match literally

16.6.4 Exercises (Q)

1.For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

a.Find all words that start or end with x. -str_extract_all(something, “\b\wx\w\b”)[[1]]

b.Find all words that start with a vowel and end with a consonant. -str_detect(something, “^x|[^x]$”) pattern <- “\b[aeiouAEIOU][a-zA-Z]*[^aeiouAEIOU\\W]\b”

c.Are there any words that contain at least one of each different vowel? pattern <- “\b(?=.a)(?=.e)(?=.i)(?=.o)(?=.*u)\w+\b”

2.Construct patterns to find evidence for and against the rule “i before e except after c”? -pattern_for_ie_after_c <- “\b\wcie\w\b” -pattern_for_cei <- “\b\w[^c]cei\w\b”

3.colors() contains a number of modifiers like “lightgray” and “darkblue”. How could you automatically identify these modifiers? (Think about how you might detect and then removed the colors that are modified).

4.Create a regular expression that finds any base R dataset. You can get a list of these datasets via a special use of the data() function: data(package = “datasets”)$results[, “Item”]. Note that a number of old datasets are individual vectors; these contain the name of the grouping “data frame” in parentheses, so you’ll need to strip those off.

-base_datasets <- data(package = “datasets”)$results[, “Item”]

-regex_pattern <- “^(\w+)(\s\(.\))?$”

-matched_datasets <- character()

16.7 Regular expressions in other places

16.7.1 tidyverse

-There are three other particularly useful places where you might want to use a regular expressions

-matches(pattern) will select all variables whose name matches the supplied pattern.

-pivot_longer()’s names_pattern argument takes a vector of regular expressions, just like separate_wider_regex(). It’s useful when extracting data out of variable names with a complex structure

-The delim argument in separate_longer_delim() and separate_wider_delim() usually matches a fixed string, but you can use regex() to make it match a pattern.

16.7.2 Base R

-apropos(pattern) searches all objects available from the global environment that match the given pattern.

17 Factors

17.2 Factor basics

-Create a list of the valid levels, and then create a factor following these valid levels.

-If you omit the levels, they’ll be taken from the data in alphabetical order

-Sorting alphabetically is slightly risky because not every computer will sort strings in the same way. So forcats::fct() orders by first appearance

-If you ever need to access the set of valid levels directly, you can do so with levels()

17.3 General Social Survey

1.Explore the distribution of rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot?

ggplot(gss_cat, aes(rincome)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)

The default bar chart’s x-axis is unreadable for overlapping labels.

#Switch around and do scale_x_discrete
ggplot(gss_cat, aes(rincome)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE) +
  coord_flip()

2.What is the most common relig in this survey? What’s the most common partyid?

#Most common relig
gss_cat %>%
  count(relig) %>%
  arrange(-n) %>%
  head(3)
## # A tibble: 3 × 2
##   relig          n
##   <fct>      <int>
## 1 Protestant 10846
## 2 Catholic    5124
## 3 None        3523
#Most common partyid
gss_cat %>%
  count(partyid) %>%
  arrange(-n) %>%
  head(3)
## # A tibble: 3 × 2
##   partyid              n
##   <fct>            <int>
## 1 Independent       4119
## 2 Not str democrat  3690
## 3 Strong democrat   3490

3.Which relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualization?

#Which relig does denom (denomination) apply to
levels(gss_cat$denom)
##  [1] "No answer"            "Don't know"           "No denomination"     
##  [4] "Other"                "Episcopal"            "Presbyterian-dk wh"  
##  [7] "Presbyterian, merged" "Other presbyterian"   "United pres ch in us"
## [10] "Presbyterian c in us" "Lutheran-dk which"    "Evangelical luth"    
## [13] "Other lutheran"       "Wi evan luth synod"   "Lutheran-mo synod"   
## [16] "Luth ch in america"   "Am lutheran"          "Methodist-dk which"  
## [19] "Other methodist"      "United methodist"     "Afr meth ep zion"    
## [22] "Afr meth episcopal"   "Baptist-dk which"     "Other baptists"      
## [25] "Southern baptist"     "Nat bapt conv usa"    "Nat bapt conv of am" 
## [28] "Am bapt ch in usa"    "Am baptist asso"      "Not applicable"
#How can you find out with a table
gss_cat %>%
  filter(!denom %in% c("No answer", "Other", "Don't know", "Not applicable",   "No denomination")) %>%
  count(relig)
## # A tibble: 1 × 2
##   relig          n
##   <fct>      <int>
## 1 Protestant  7025
#How can you find out with a visualization
gss_cat %>%
  count(relig, denom) %>%
  ggplot(aes(x = relig, y = denom, size = n)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90))

17.4 Modifying factor order

  1. There are some suspiciously high numbers in tvhours. Is the mean a good summary?
gss_cat %>%
  filter(!is.na(tvhours)) %>%
  ggplot(aes(x = tvhours)) +
  geom_histogram(binwidth = 1)

There are some outliners in tvhours. It is better to use median instead of mean.

  1. For each factor in gss_cat identify whether the order of the levels is arbitrary or principled.
levels(gss_cat$marital)
## [1] "No answer"     "Never married" "Separated"     "Divorced"     
## [5] "Widowed"       "Married"
#marital is arbitrary

levels(gss_cat$race)
## [1] "Other"          "Black"          "White"          "Not applicable"
#race is arbitrary

levels(gss_cat$rincome)
##  [1] "No answer"      "Don't know"     "Refused"        "$25000 or more"
##  [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" 
##  [9] "$7000 to 7999"  "$6000 to 6999"  "$5000 to 5999"  "$4000 to 4999" 
## [13] "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"       "Not applicable"
#rincome is principled

levels(gss_cat$partyid)
##  [1] "No answer"          "Don't know"         "Other party"       
##  [4] "Strong republican"  "Not str republican" "Ind,near rep"      
##  [7] "Independent"        "Ind,near dem"       "Not str democrat"  
## [10] "Strong democrat"
#partyid is arbitrary

levels(gss_cat$relig)
##  [1] "No answer"               "Don't know"             
##  [3] "Inter-nondenominational" "Native american"        
##  [5] "Christian"               "Orthodox-christian"     
##  [7] "Moslem/islam"            "Other eastern"          
##  [9] "Hinduism"                "Buddhism"               
## [11] "Other"                   "None"                   
## [13] "Jewish"                  "Catholic"               
## [15] "Protestant"              "Not applicable"
#relig is arbitrary

levels(gss_cat$denom)
##  [1] "No answer"            "Don't know"           "No denomination"     
##  [4] "Other"                "Episcopal"            "Presbyterian-dk wh"  
##  [7] "Presbyterian, merged" "Other presbyterian"   "United pres ch in us"
## [10] "Presbyterian c in us" "Lutheran-dk which"    "Evangelical luth"    
## [13] "Other lutheran"       "Wi evan luth synod"   "Lutheran-mo synod"   
## [16] "Luth ch in america"   "Am lutheran"          "Methodist-dk which"  
## [19] "Other methodist"      "United methodist"     "Afr meth ep zion"    
## [22] "Afr meth episcopal"   "Baptist-dk which"     "Other baptists"      
## [25] "Southern baptist"     "Nat bapt conv usa"    "Nat bapt conv of am" 
## [28] "Am bapt ch in usa"    "Am baptist asso"      "Not applicable"
#denom is arbitrary
  1. Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?

-Because it is determined by the factor level. Combine fct_relevel() and “Not applicable” to the front will subsequently move “Not applicable” to the bottom of the plot.

17.5 Modifying factor levels

-fct_recode() will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.

-If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode()

-Sometimes you just want to lump together the small groups to make a plot or table simpler. That’s the job of the fct_lump_*() family of functions. fct_lump_lowfreq() is a simple starting point that progressively lumps the smallest groups categories into “Other”, always keeping “Other” as the smallest category.

  1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
gss_cat %>%
  mutate(partyid = fct_collapse(partyid,
    other = c("No answer", "Don't know", "Other party"),
    rep = c("Strong republican", "Not str republican"),
    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
    dem = c("Not str democrat", "Strong democrat"))) |>
  group_by(year, partyid) |>
  summarize(n = n()) |>
  ggplot(mapping = aes(x = year, y = n, color = fct_reorder2(partyid, year, n))) +
  geom_point() +
  geom_line() +
  labs(color = 'Party',
       x = 'Year',
       y = 'Count')
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

The general trends of changing identity are similar within three groups, but Independent has the largest volume of changes.

  1. How could you collapse rincome into a small set of categories?
gss_cat |>
  mutate(rincome = fct_collapse(rincome,
    "No answer" = c("No answer", "Don't know", "Refused"),
    "$0 to 5000" = c("Lt $1000", "$1000 to 3000", "$3001 to 4000", "$4001 to 5000"),
    "$5001 to 10000" = c("$5001 to 6000", "$6001 to 7000",
                        "$7001 to 8000", "$8001 to 10000"))) |>
  count(rincome)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `rincome = fct_collapse(...)`.
## Caused by warning:
## ! Unknown levels in `f`: $1000 to 3000, $3001 to 4000, $4001 to 5000, $5001 to 6000, $6001 to 7000, $7001 to 8000, $8001 to 10000
## # A tibble: 14 × 2
##    rincome            n
##    <fct>          <int>
##  1 No answer       1425
##  2 $25000 or more  7363
##  3 $20000 - 24999  1283
##  4 $15000 - 19999  1048
##  5 $10000 - 14999  1168
##  6 $8000 to 9999    340
##  7 $7000 to 7999    188
##  8 $6000 to 6999    215
##  9 $5000 to 5999    227
## 10 $4000 to 4999    226
## 11 $3000 to 3999    276
## 12 $1000 to 2999    395
## 13 $0 to 5000       286
## 14 Not applicable  7043
  1. Notice there are 9 groups (excluding other) in the fct_lump example above. Why not 10? (Hint: type ?fct_lump, and find the default for the argument other_level is “Other”.)

The fct_lump function was applied to a factor variable with 10 original levels. Since the default other_level value is “Other,” it combines the less frequent levels into a single “Other” level. As a result, WE have 9 groups (including the “Other” group) instead of the original 10 distinct levels.

17.6 Ordered factors

-Ordered factors, created with ordered(), imply a strict ordering and equal distance between levels: the first level is “less than” the second level by the same amount that the second level is “less than” the third level, and so on.

18 Dates and times

library(tidyverse)
library(nycflights13)

18.2 Creating date/times

18.2.1 During import

-If your CSV contains an ISO8601 date or date-time, you don’t need to do anything; readr will automatically recognize it

-For other date-time formats, you’ll need to use col_types plus col_date() or col_datetime() along with a date-time format.

-If you’re using %b or %B and working with non-English dates, you’ll also need to provide a locale(). See the list of built-in languages in date_names_langs(), or create your own with date_names(),

18.2.2 From strings

  • Identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date.

  • ymd() and friends create dates.

  • To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times

18.2.4 From other types

-You may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date():

-Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use as_datetime(); if it’s in days, use as_date().

18.2.5 Exercises

  1. What happens if you parse a string that contains invalid dates?
ymd(c("2010-10-10", "bananas"))
## Warning: 1 failed to parse.
## [1] "2010-10-10" NA

-It will report a failed to parse.

  1. What does the tzone argument to today() do? Why is it important?

-It is a character vector specifying which time zone you would like the current time in. It is important since different time-zones can have different dates, and tzone can help us specify the time.

  1. For each of the following date-times, show how you’d parse it using a readr column specification and a lubridate function.

d1 <- “January 1, 2010” d2 <- “2015-Mar-07” d3 <- “06-Jun-2017” d4 <- c(“August 19 (2015)”, “July 1 (2015)”) d5 <- “12/30/14” # Dec 30, 2014 t1 <- “1705” t2 <- “11:15:10.12 PM”

library(lubridate)

d1 <- "January 1, 2010"
parse_date(d1, format = "%B %d, %Y")
## [1] "2010-01-01"
d2 <- "2015-Mar-07"
parse_date(d2, format = "%Y-%b-%d")
## [1] "2015-03-07"
d3 <- "06-Jun-2017"
parse_date(d3, format = "%d-%b-%Y")
## [1] "2017-06-06"
d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, format = "%B %d (%Y)")
## [1] "2015-08-19" "2015-07-01"
d5 <- "12/30/14"
parsed_date5 <- parse_date(d5, format = "%m/%d/%y")

t1 <- "1705"
parsed_time1 <- hms(t1)

t2 <- "11:15:10.12 PM"
parsed_time2 <- hms(paste(t2, "12"))

18.3 Date-time components

18.3.1 Getting components

-You can pull out individual parts of the date with the accessor functions year(), month(), mday() (day of the month), yday() (day of the year), wday() (day of the week), hour(), minute(), and second(). These are effectively the opposites of make_datetime().

-For month() and wday() you can set label = TRUE to return the abbreviated name of the month or day of the week. Set abbr = FALSE to return the full name.

-We can use wday() to see that more flights depart during the week than on the weekend

18.3.2 Rounding

-An alternative approach to plotting individual components is to round the date to a nearby unit of time, with floor_date(), round_date(), and ceiling_date(). Each function takes a vector of dates to adjust and then the name of the unit to round down (floor), round up (ceiling), or round to.

18.3.3 Modifying components

-Alternatively, rather than modifying an existing variable, you can create a new date-time with update()

18.3.4 Exercises (Q)

1.How does the distribution of flight times within a day change over the course of the year?

#Preparation
flights |> 
  select(year, month, day, hour, minute) |> 
  mutate(departure = make_datetime(year, month, day, hour, minute))
## # A tibble: 336,776 × 6
##     year month   day  hour minute departure          
##    <int> <int> <int> <dbl>  <dbl> <dttm>             
##  1  2013     1     1    51     15 2013-01-03 03:15:00
##  2  2013     1     1    52     29 2013-01-03 04:29:00
##  3  2013     1     1    54     40 2013-01-03 06:40:00
##  4  2013     1     1    54     45 2013-01-03 06:45:00
##  5  2013     1     1    60      0 2013-01-03 12:00:00
##  6  2013     1     1    55     58 2013-01-03 07:58:00
##  7  2013     1     1    60      0 2013-01-03 12:00:00
##  8  2013     1     1    60      0 2013-01-03 12:00:00
##  9  2013     1     1    60      0 2013-01-03 12:00:00
## 10  2013     1     1    60      0 2013-01-03 12:00:00
## # ℹ 336,766 more rows
make_datetime_100 <- function(year, month, day, time) {
  make_datetime(year, month, day, time %/% 100, time %% 100)
}


flights_dt <- flights |> 
  filter(!is.na(dep_time), !is.na(arr_time)) |> 
  mutate(
    dep_time = as.integer(dep_time),
    arr_time = as.integer(arr_time),
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) |> 
  select(origin, dest, ends_with("delay"), ends_with("time"))

flights_dt
## # A tibble: 328,063 × 9
##    origin dest  dep_delay arr_delay dep_time            sched_dep_time     
##    <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
##  1 EWR    IAH           2        11 2013-01-01 00:05:00 2013-01-01 05:15:00
##  2 LGA    IAH           4        20 2013-01-01 00:05:00 2013-01-01 05:29:00
##  3 JFK    MIA           2        33 2013-01-01 00:05:00 2013-01-01 05:40:00
##  4 JFK    BQN          -1       -18 2013-01-01 00:05:00 2013-01-01 05:45:00
##  5 LGA    ATL          -6       -25 2013-01-01 00:05:00 2013-01-01 06:00:00
##  6 EWR    ORD          -4        12 2013-01-01 00:05:00 2013-01-01 05:58:00
##  7 EWR    FLL          -5        19 2013-01-01 00:05:00 2013-01-01 06:00:00
##  8 LGA    IAD          -3       -14 2013-01-01 00:05:00 2013-01-01 06:00:00
##  9 JFK    MCO          -3        -8 2013-01-01 00:05:00 2013-01-01 06:00:00
## 10 LGA    ORD          -2         8 2013-01-01 00:06:00 2013-01-01 06:00:00
## # ℹ 328,053 more rows
## # ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>
#Plot
flights_dt |>
  filter(!is.na(dep_time)) |>
  mutate(dep_hour = update(dep_time, yday = 1)) |>
  mutate(month = factor(month(dep_time))) |>
  ggplot(aes(x=dep_hour, group = month, color = month))+
  geom_freqpoly(binwidth = 200 * 200)

2.Compare dep_time, sched_dep_time and dep_delay. Are they consistent? Explain your findings.

flights_dt |> 
  select(contains('dep')) |>
  mutate(cal_delay = as.numeric(dep_time - sched_dep_time) / 60) |>
  filter(dep_delay != cal_delay)
## # A tibble: 328,063 × 4
##    dep_delay dep_time            sched_dep_time      cal_delay
##        <dbl> <dttm>              <dttm>                  <dbl>
##  1         2 2013-01-01 00:05:00 2013-01-01 05:15:00   -0.0861
##  2         4 2013-01-01 00:05:00 2013-01-01 05:29:00   -0.09  
##  3         2 2013-01-01 00:05:00 2013-01-01 05:40:00   -0.0931
##  4        -1 2013-01-01 00:05:00 2013-01-01 05:45:00   -0.0944
##  5        -6 2013-01-01 00:05:00 2013-01-01 06:00:00   -0.0986
##  6        -4 2013-01-01 00:05:00 2013-01-01 05:58:00   -0.0981
##  7        -5 2013-01-01 00:05:00 2013-01-01 06:00:00   -0.0986
##  8        -3 2013-01-01 00:05:00 2013-01-01 06:00:00   -0.0986
##  9        -3 2013-01-01 00:05:00 2013-01-01 06:00:00   -0.0986
## 10        -2 2013-01-01 00:06:00 2013-01-01 06:00:00   -0.0983
## # ℹ 328,053 more rows

-They are not consistent. There are existing minor time difference between scheduled departure time and departure time. Such difference can be explained by the delay time, which is divert from scheduled time.

3.Compare air_time with the duration between the departure and arrival. Explain your findings. (Hint: consider the location of the airport.) (Why duration is ZERO?)

flights_dt |>
  mutate(
    flight_duration = as.numeric(arr_time - dep_time),
    air_time_mins = air_time,
    diff = flight_duration - air_time_mins
  ) |>
  select(origin, dest, flight_duration, air_time_mins, diff)
## # A tibble: 328,063 × 5
##    origin dest  flight_duration air_time_mins  diff
##    <chr>  <chr>           <dbl>         <dbl> <dbl>
##  1 EWR    IAH               180           227   -47
##  2 LGA    IAH               180           227   -47
##  3 JFK    MIA               240           160    80
##  4 JFK    BQN               300           183   117
##  5 LGA    ATL               180           116    64
##  6 EWR    ORD               120           150   -30
##  7 EWR    FLL               240           158    82
##  8 LGA    IAD               120            53    67
##  9 JFK    MCO               180           140    40
## 10 LGA    ORD                60           138   -78
## # ℹ 328,053 more rows

4.How does the average delay time change over the course of a day? Should you use dep_time or sched_dep_time? Why?

#dep_time
flights_dt |>
  mutate(sched_dep_hour = hour(dep_time)) |>
  group_by(dep_time) |>
  summarise(dep_delay = mean(dep_delay)) |>
  ggplot(aes(y = dep_delay, x = dep_time)) +
  geom_point() +
  geom_smooth()

#sched_dep_hour
flights_dt |>
  mutate(sched_dep_hour = hour(sched_dep_time)) |>
  group_by(sched_dep_hour) |>
  summarise(dep_delay = mean(dep_delay)) |>
  ggplot(aes(y = dep_delay, x = sched_dep_hour)) +
  geom_point() +
  geom_smooth()

-We use sched_dep_time since the dep_time will generate biased delays to later in the day.

5.On what day of the week should you leave if you want to minimise the chance of a delay?

flights_dt |>
  mutate(weekday = wday(sched_dep_time, label = TRUE)) |>
  group_by(weekday) |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE),
            avg_arr_delay = mean(arr_delay, na.rm = TRUE)) |>
  gather(key = 'delay', value = 'minutes', 2:3) |>
  ggplot() +
  geom_col(mapping = aes(x = weekday, y = minutes, fill = delay),
           position = 'dodge')

-Looks like Saturday is the best day for a flight.

6.What makes the distribution of diamonds\(carat and flights\)sched_dep_time similar?

#The distribution of diamonds
diamonds |>
  ggplot() +
  geom_freqpoly(mapping = aes(x = carat), binwidth = .04)

#The distribution of flights
flights_dt |>
  mutate(minutes = minute(sched_dep_time)) |>
  ggplot() +
  geom_freqpoly(mapping = aes(x = minutes), binwidth = 1)

-It might be that the human factor caused this similarity for the “nice” dataset(?)

7.Confirm our hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.

flights_dt |>
  mutate(delayed = dep_delay > 0,
         minutes = minute(sched_dep_time) %/% 10 * 10,
         minutes = factor(minutes, levels = c(0,10,20,30,40,50),
                          labels = c('0 - 9 mins',
                                     '10 - 19 mins',
                                     '20 - 29 mins',
                                     '30 - 39 mins',
                                     '40 - 49 mins',
                                     '50 - 50 mins'))) |>
  group_by(minutes) |>
  summarize(prop_early = 1 - mean(delayed, na.rm = TRUE)) |>
  ggplot() +
  geom_point(mapping = aes(x = minutes, y = prop_early)) +
  labs(x = 'Scheduled departure (minutes)',
       y = 'Proportion of early departures')

18.4 Time spans

18.4.1 Durations

-A difftime class object records a time span of seconds, minutes, hours, days, or weeks.

18.4.2 Periods

-Periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months.

18.4.3 Intervals

We can create an interval by writing start %–% end

18.4.4 Exercises (Q, 4)

1.Explain days(!overnight) and days(overnight) to someone who has just started learning R. What is the key fact you need to know?

-Well, overnight itself is a boolean variable. So, days(!overnight) means overnight is FALSE, and the flight arrive on the same day. days(overnight) means overnight is TRUE, and will add one day to the arr_time and sched_arr_time datetime.

2.Create a vector of dates giving the first day of every month in 2015. Create a vector of dates giving the first day of every month in the current year.

year_2015 <- years(2015) + months(c(1:12)) + days(1)
year_2015
##  [1] "2015y 1m 1d 0H 0M 0S"  "2015y 2m 1d 0H 0M 0S"  "2015y 3m 1d 0H 0M 0S" 
##  [4] "2015y 4m 1d 0H 0M 0S"  "2015y 5m 1d 0H 0M 0S"  "2015y 6m 1d 0H 0M 0S" 
##  [7] "2015y 7m 1d 0H 0M 0S"  "2015y 8m 1d 0H 0M 0S"  "2015y 9m 1d 0H 0M 0S" 
## [10] "2015y 10m 1d 0H 0M 0S" "2015y 11m 1d 0H 0M 0S" "2015y 12m 1d 0H 0M 0S"
year_current <- years(year(today())) + months(c(1:12)) + days(1)
year_current
##  [1] "2023y 1m 1d 0H 0M 0S"  "2023y 2m 1d 0H 0M 0S"  "2023y 3m 1d 0H 0M 0S" 
##  [4] "2023y 4m 1d 0H 0M 0S"  "2023y 5m 1d 0H 0M 0S"  "2023y 6m 1d 0H 0M 0S" 
##  [7] "2023y 7m 1d 0H 0M 0S"  "2023y 8m 1d 0H 0M 0S"  "2023y 9m 1d 0H 0M 0S" 
## [10] "2023y 10m 1d 0H 0M 0S" "2023y 11m 1d 0H 0M 0S" "2023y 12m 1d 0H 0M 0S"

3.Write a function that given your birthday (as a date), returns how old you are in years.

howold <- function(d) {
  age <- today() - d
  return(floor(age/dyears(1)))
}

howold(ymd(19980419))
## [1] 25

4.Why can’t (today() %–% (today() + years(1))) / months(1) work? (?)

(today() %--% (today() + years(1))) / months(1)
## [1] 12

18.5 Time zones

-Use Sys.timezone() to find current time zone.

-OlsonNames() provides all time zones.

-Change time zones:1. Keep the instant in time the same, and change how it’s displayed. Use this when the instant is correct, but you want a more natural display. 2. Change the underlying instant in time. Use this when you have an instant that has been labelled with the incorrect time zone, and you need to fix it.

19 Missing values

19.2 Explicit missing values

19.2.1 Last observation carried forward

-When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward)

-We can fill in these missing values with tidyr::fill(). It works like select(), taking a set of columns.

19.2.2 Fixed values

-Some times missing values represent some fixed and known value, most commonly 0. You can use dplyr::coalesce() to replace them.

-If possible, handle this when reading in the data, for example, by using the na argument to readr::read_csv(), e.g., read_csv(path, na = “99”). If you discover the problem later, or your data source doesn’t provide a way to handle it on read, you can use dplyr::na_if()

19.2.3 NaN

-a NaN (pronounced “nan”), or not a number; generally behaves just like NA. In the rare case you need to distinguish an NA from a NaN, you can use is.nan(x).

18.3 Implicit missing values

-An explicit missing value is the presence of an absence.

-An implicit missing value is the absence of a presence.

18.3.1 Pivoting

-Making data wider can make implicit missing values explicit because every combination of the rows and new columns must have some value.

-By default, making data longer preserves explicit missing values, but if they are structurally missing values that only exist because the data is not tidy, you can drop them (make them implicit) by setting values_drop_na = TRUE.

18.3.2 Complete

-tidyr::complete() allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist.

-Usually call complete() with names of existing variables, filling in the missing combinations. However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data.

-If the range of a variable is correct, but not all values are present, you could use full_seq(x, 1) to generate all values from min(x) to max(x) spaced out by 1.

18.3.3 Joins

-dplyr::anti_join(x, y) is a particularly useful tool here because it selects only the rows in x that don’t have a match in y.

18.3.4 Exercises

Can you find any relationship between the carrier and the rows that appear to be missing from planes?

missing_planes <- anti_join(flights, planes, by = "tailnum")

missing_planes |>
  group_by(carrier) |>
  summarize(missing_planes = n()) 
## # A tibble: 10 × 2
##    carrier missing_planes
##    <chr>            <int>
##  1 9E                1044
##  2 AA               22558
##  3 B6                 830
##  4 DL                 110
##  5 F9                  50
##  6 FL                 187
##  7 MQ               25397
##  8 UA                1693
##  9 US                 699
## 10 WN                  38

It appears that AA and MQ have the most missing rows.

18.4 Factors and empty groups

-A final type of missingness is the empty group, a group that doesn’t contain any observations, which can arise when working with factors.

-We can use .drop = FALSE to preserve all factor levels.

-All summary functions work with zero-length vectors, but they may return results that are surprising at first glance.

-Sometimes a simpler approach is to perform the summary and then make the implicit missings explicit with complete().

19 Joins

19.2 Keys

19.2.1 Primary and foreign keys

-A primary key is a variable or set of variables that uniquely identifies each observation. When more than one variable is needed, the key is called a compound key.

-A foreign key is a variable (or set of variables) that corresponds to a primary key in another table.

19.2.2 Checking primary keys

-One way to do that is to count() the primary keys and look for entries where n is greater than one.

-You should also check for missing values in your primary keys — if a value is missing then it can’t identify an observation!

19.2.3 Surrogate keys

-Surrogate keys can be particularly useful when communicating to other humans:

19.2.4 Exercises

  1. We forgot to draw the relationship between weather and airports in Figure 19.1. What is the relationship and how should it appear in the diagram?
library(nycflights13)
summary(weather)
##     origin               year          month             day       
##  Length:26115       Min.   :2013   Min.   : 1.000   Min.   : 1.00  
##  Class :character   1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00  
##  Mode  :character   Median :2013   Median : 7.000   Median :16.00  
##                     Mean   :2013   Mean   : 6.504   Mean   :15.68  
##                     3rd Qu.:2013   3rd Qu.: 9.000   3rd Qu.:23.00  
##                     Max.   :2013   Max.   :12.000   Max.   :31.00  
##                                                                    
##       hour            temp             dewp           humid       
##  Min.   : 0.00   Min.   : 10.94   Min.   :-9.94   Min.   : 12.74  
##  1st Qu.: 6.00   1st Qu.: 39.92   1st Qu.:26.06   1st Qu.: 47.05  
##  Median :11.00   Median : 55.40   Median :42.08   Median : 61.79  
##  Mean   :11.49   Mean   : 55.26   Mean   :41.44   Mean   : 62.53  
##  3rd Qu.:17.00   3rd Qu.: 69.98   3rd Qu.:57.92   3rd Qu.: 78.79  
##  Max.   :23.00   Max.   :100.04   Max.   :78.08   Max.   :100.00  
##                  NA's   :1        NA's   :1       NA's   :1       
##     wind_dir       wind_speed         wind_gust         precip        
##  Min.   :  0.0   Min.   :   0.000   Min.   :16.11   Min.   :0.000000  
##  1st Qu.:120.0   1st Qu.:   6.905   1st Qu.:20.71   1st Qu.:0.000000  
##  Median :220.0   Median :  10.357   Median :24.17   Median :0.000000  
##  Mean   :199.8   Mean   :  10.518   Mean   :25.49   Mean   :0.004469  
##  3rd Qu.:290.0   3rd Qu.:  13.809   3rd Qu.:28.77   3rd Qu.:0.000000  
##  Max.   :360.0   Max.   :1048.361   Max.   :66.75   Max.   :1.210000  
##  NA's   :460     NA's   :4          NA's   :20778                     
##     pressure          visib          time_hour                    
##  Min.   : 983.8   Min.   : 0.000   Min.   :2013-01-01 01:00:00.0  
##  1st Qu.:1012.9   1st Qu.:10.000   1st Qu.:2013-04-01 21:30:00.0  
##  Median :1017.6   Median :10.000   Median :2013-07-01 14:00:00.0  
##  Mean   :1017.9   Mean   : 9.255   Mean   :2013-07-01 18:26:37.7  
##  3rd Qu.:1023.0   3rd Qu.:10.000   3rd Qu.:2013-09-30 13:00:00.0  
##  Max.   :1042.1   Max.   :10.000   Max.   :2013-12-30 18:00:00.0  
##  NA's   :2729
summary(airports)
##      faa                name                lat             lon         
##  Length:1458        Length:1458        Min.   :19.72   Min.   :-176.65  
##  Class :character   Class :character   1st Qu.:34.26   1st Qu.:-119.19  
##  Mode  :character   Mode  :character   Median :40.09   Median : -94.66  
##                                        Mean   :41.65   Mean   :-103.39  
##                                        3rd Qu.:45.07   3rd Qu.: -82.52  
##                                        Max.   :72.27   Max.   : 174.11  
##       alt                tz              dst               tzone          
##  Min.   : -54.00   Min.   :-10.000   Length:1458        Length:1458       
##  1st Qu.:  70.25   1st Qu.: -8.000   Class :character   Class :character  
##  Median : 473.00   Median : -6.000   Mode  :character   Mode  :character  
##  Mean   :1001.42   Mean   : -6.519                                        
##  3rd Qu.:1062.50   3rd Qu.: -5.000                                        
##  Max.   :9078.00   Max.   :  8.000
  1. weather only contains information for the three origin airports in NYC. If it contained weather records for all airports in the USA, what additional connection would it make to flights?

  2. The year, month, day, hour, and origin variables almost form a compound key for weather, but there’s one hour that has duplicate observations. Can you figure out what’s special about that hour?

  3. We know that some days of the year are special and fewer people than usual fly on them (e.g., Christmas eve and Christmas day). How might you represent that data as a data frame? What would be the primary key? How would it connect to the existing data frames?

  4. Draw a diagram illustrating the connections between the Batting, People, and Salaries data frames in the Lahman package. Draw another diagram that shows the relationship between People, Managers, AwardsManagers. How would you characterize the relationship between the Batting, Pitching, and Fielding data frames?


  1. ^y↩︎

  2. a-z↩︎

  3. a-z↩︎

  4. ^aeiou↩︎